spaCy and OOV-Tokens

NOlbert · July 24, 2020, 8:46am

Hello,

I read the article from @amn41 about supervised embeddings. There it is stated that:

You often don’t have word vectors for some important words,[…] With our standard approach [I assume that was spaCy at that time], you can’t learn vectors for these words, so they never carry any signal.

While I can see how this is possible with the old pretrained_embeddings_spacy-Pipeline which was defined as followed:

language: "en"
pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"

- name: "SklearnIntentClassifier"

I wonder if this is still true for the currently suggested spaCy-Pipeline:

pipeline:
  - name: SpacyNLP
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

Here are additional Featurizers - like the CountVectorsFeaturizer - present, therefore it should be possible for the classifier to take advantages of the features generated by these featurizers. So if a word is OOV for spaCy it should nowadays (the original article is from 2018) still carry a signal for the prediciton.

Is my assumption correct?

amn41 · July 29, 2020, 11:52am

yes, that’s right! The countvectors featurizer will also generate features for words which are out of the vocab of the spaCy model

Topic		Replies	Views
Custom word vectors and pipelines using them Rasa Open Source	5	1070	March 20, 2020
Pre-load vectors for Tensorflow ranking classifier pipeline Rasa Open Source	4	1619	September 1, 2018
SpacyNLP and supervised_embeddings in the same pipeline Rasa Open Source	0	447	May 21, 2019
Spacy with supervised_embeddings pipeline Rasa Open Source	0	342	February 18, 2020
Features for the SKLearnIntentClassifier Rasa Open Source	2	432	March 23, 2020

spaCy and OOV-Tokens

Related topics