I’m trying to use custom embeddings or pretrained embeddings with ner_crf for entity extraction, but can’t find a proper tutorial for it yet. I have tried using fasttext with spacy but I don’t think the embeddings are being used by ner_crf(as I’m not using POS tags feature with ner_crf).
If I had to feed custom embeddings as an additional feature to ner_crf, how should I do it with/without spacy(Spacy doesn’t have support for Bert embeddings yet)?
@Juste Yes, I did and have followed this GitHub issue to use FastText with rasa. But going through the code I see that Spacy is only used when pos_features are used with the ner_crf(and pos features aren’t included in default params of ner_crf). I tried using the pos_features(in the config file) as well but did not find any improvements. So my questions are:
assuming the embeddings I want are available in spacy and I create a package following the instructions, does rasa only use pos_tags as features for ner_crf? (Going through the code, I couldn’t find where the actual word embeddings are directly used. So maybe I’m missing something.)
Let’s say Spacy doesn’t have embeddings I want to use(or I have custom embeddings for every word or subword). How do I pass them as features to ner_crf?
Hi @gowtham1997, did you manage to figure this out? I’m having the same problem and I also went through the code, without being able to figure out where and how the actual word embeddings are used by ner_crf.
I have a work-in-progress PR to discuss how to pass these kinds of features to ner_crf/CRFEntityExtractor. This would then pair with another new component like SpacyVectorEntityFeaturizer that would pass the features along. That way if any new components for custom NER came along, it would be reusable.
If you have a SpacyFeaturizer whose component config specifies ner_feature_vectors: true, it should work. It will make token.vector available to CRFEntityExtractor for every token in the spacy.Doc
This worked and we started to chat with the bot but I know we do not have any entity extraction so it is pretty lame, can you maybe help me with what should I do next, should I wait for your solution to come to master branch or are there things I need to do beforehand?
You just need to replace the pipeline: "pretrained_embeddings_spacy" with individual components. You can pick and choose, but if you want mostly spacy based components, you could do:
pipeline:
- name: 'SpacyNLP'
model: 'your_model_name_here'
- name: 'SpacyTokenizer'
- name: 'SpacyFeaturizer'
ner_feature_vectors: true # this is the part that's new functionality
- name: 'CRFEntityExtractor'
- name: 'EmbeddingIntentClassifier'
This would use spacy to tokenize, would create features for intents using the .vector attribute on the Doc, and would pass the .vector attribute on each token to the CRFEntityExtractor as (some of) the features to do custom entity extraction.
Hi @Juste ,
Can Rasa just use idea of this paper(" Massively Multilingual Sentence Embeddings for Zero-ShotCross-Lingual Transfer and Beyond ") in place of pretrained embeddings in DIET Classifier archiecture. Instead of using GLoVe and BERT or ConveRT.
Early reply will be admired…