Hello,
I read the article from @amn41 about supervised embeddings. There it is stated that:
You often don’t have word vectors for some important words,[…] With our standard approach [I assume that was spaCy at that time], you can’t learn vectors for these words, so they never carry any signal.
While I can see how this is possible with the old pretrained_embeddings_spacy-Pipeline which was defined as followed:
language: "en"
pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"
I wonder if this is still true for the currently suggested spaCy-Pipeline:
pipeline:
- name: SpacyNLP
- name: SpacyTokenizer
- name: SpacyFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
Here are additional Featurizers - like the CountVectorsFeaturizer
- present, therefore it should be possible for the classifier to take advantages of the features generated by these featurizers. So if a word is OOV for spaCy it should nowadays (the original article is from 2018) still carry a signal for the prediciton.
Is my assumption correct?