Choosing NLU pipeline

I have train data with the following characteristics/stats:

  • intent examples: 11263 (2 distinct intents)
    • Found intents: ‘general’, ‘irrelevant’
    • Number of response examples: 0 (0 distinct response)
    • entity examples: 9407 (22 distinct entities)
    • found entities: ‘’, ‘company’, ‘amount_price_target’, ‘analyst’, ‘financial_topic’, ‘financial_instrument’, ‘period’, ‘person’, ‘price_movement’, ‘hashtag’, ‘publication’, ‘ticker’, ‘amount’, ‘percent’, ‘number’, ‘media_type’, ‘location’, ‘rating_agency’, ‘event’, ‘exchange’, ‘product’, ‘sector’

I wanted to check if the supervised_embeddings.yml will outperform the pretrained_embeddings_spacy.yml concerning entity extraction, so I perform rasa test nlu --config pretrained_embeddings_spacy.yml supervised_embeddings.yml --nlu CF_model/config_en.json --runs 3 --percentages 0 25 50 70 90 Is this approach ok? or the results will be the same for both approaches for entity extraction? In the dataset we do not use the intents (just two intents) and we use financial entities.

Welcome to the community, @igormis :tada:

ner_crf does currently not use any features from the intent classification part. So there shouldn’t be any difference between them.

tnx Tobias, however I get memory error whenever I run the test command…

@igormis Just talked to one of our researchers and my answer was wrong :see_no_evil: Depending on the configuration of your crf component, the features of previous components affect the entity extraction (See NLU Training Data) .

tnx Tobias, however I get memory error whenever I run the test command…

  • How much training data
  • Rasa version
  • what’s the error message in detail?
  • The output is not so descriptive:

    2019-12-12 11:58:59 INFO rasa.nlu.model - Finished training component. 2019-12-12 11:58:59 INFO rasa.nlu.model - Starting to train component EntitySynonymMapper 2019-12-12 11:58:59 INFO rasa.nlu.model - Finished training component. 2019-12-12 11:58:59 INFO rasa.nlu.model - Starting to train component CountVectorsFeaturizer 2019-12-12 11:59:00 INFO rasa.nlu.model - Finished training component. 2019-12-12 11:59:00 INFO rasa.nlu.model - Starting to train component CountVectorsFeaturizer Killed

  • Rasa version = 1.5.1

  • Training data stats:

    • intent examples: 11262 (2 distinct intents)
      • Found intents: ‘irrelevant’, ‘general’
      • Number of response examples: 0 (0 distinct response)
      • entity examples: 9407 (22 distinct entities)
      • found entities: ‘’, ‘percent’, ‘financial_topic’, ‘financial_instrument’, ‘amount_price_target’, ‘media_type’, ‘product’, ‘period’, ‘event’, ‘sector’, ‘rating_agency’, ‘analyst’, ‘person’, ‘ticker’, ‘location’, ‘company’, ‘publication’, ‘amount’, ‘price_movement’, ‘number’, ‘exchange’, ‘hashtag’
  • Data size (./CF_model/config_en.json) - 13 MB in Json format or 2.7 MB in MD format (I tried with both

  • Command: rasa test nlu -u CF_model/config_en.json --config supervised_embeddings.yml --cross-validation

  • supervised_embeddings.yml

    language: “en”

    pipeline: “supervised_embeddings”

  • Same problem when I run: rasa test nlu --config pretrained_embeddings_spacy.yml supervised_embeddings.yml --nlu CF_model/config_en.json --runs 3 --percentages 0 25 50 70 90

  • pretrained_embeddings_spacy.yml

    language: “en”

    pipeline: “pretrained_embeddings_spacy”

How much memory does your machine have? Seems the vocabulary size for the CountVectorizer is getting too big for your machine. You can restrict the size of the vocabulary using the min_df and max_df parameters Components

it is 16GB of RAM memory and only this process is Memory-intensive