I wanted to check if the supervised_embeddings.yml will outperform the pretrained_embeddings_spacy.yml concerning entity extraction, so I perform
rasa test nlu --config pretrained_embeddings_spacy.yml supervised_embeddings.yml --nlu CF_model/config_en.json --runs 3 --percentages 0 25 50 70 90
Is this approach ok? or the results will be the same for both approaches for entity extraction? In the dataset we do not use the intents (just two intents) and we use financial entities.
2019-12-12 11:58:59 INFO rasa.nlu.model - Finished training component.
2019-12-12 11:58:59 INFO rasa.nlu.model - Starting to train component EntitySynonymMapper
2019-12-12 11:58:59 INFO rasa.nlu.model - Finished training component.
2019-12-12 11:58:59 INFO rasa.nlu.model - Starting to train component CountVectorsFeaturizer
2019-12-12 11:59:00 INFO rasa.nlu.model - Finished training component.
2019-12-12 11:59:00 INFO rasa.nlu.model - Starting to train component CountVectorsFeaturizer
Killed
Rasa version = 1.5.1
Training data stats:
intent examples: 11262 (2 distinct intents)
Found intents: ‘irrelevant’, ‘general’
Number of response examples: 0 (0 distinct response)
Data size (./CF_model/config_en.json) - 13 MB in Json format or 2.7 MB in MD format (I tried with both
Command:
rasa test nlu -u CF_model/config_en.json --config supervised_embeddings.yml --cross-validation
supervised_embeddings.yml
language: “en”
pipeline: “supervised_embeddings”
Same problem when I run:
rasa test nlu --config pretrained_embeddings_spacy.yml supervised_embeddings.yml --nlu CF_model/config_en.json --runs 3 --percentages 0 25 50 70 90
How much memory does your machine have? Seems the vocabulary size for the CountVectorizer is getting too big for your machine. You can restrict the size of the vocabulary using the min_df and max_df parameters Components