spaCy and OOV-Tokens


I read the article from @amn41 about supervised embeddings. There it is stated that:

You often don’t have word vectors for some important words,[…] With our standard approach [I assume that was spaCy at that time], you can’t learn vectors for these words, so they never carry any signal.

While I can see how this is possible with the old pretrained_embeddings_spacy-Pipeline which was defined as followed:

language: "en"
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"

- name: "SklearnIntentClassifier"

I wonder if this is still true for the currently suggested spaCy-Pipeline:

  - name: SpacyNLP
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

Here are additional Featurizers - like the CountVectorsFeaturizer - present, therefore it should be possible for the classifier to take advantages of the features generated by these featurizers. So if a word is OOV for spaCy it should nowadays (the original article is from 2018) still carry a signal for the prediciton.

Is my assumption correct?

yes, that’s right! The countvectors featurizer will also generate features for words which are out of the vocab of the spaCy model :slight_smile: