Pre-load vectors for Tensorflow ranking classifier pipeline

@amn41 Hi, My question is linked to how to use tensor flow ranking with preloaded word vectors from spaCy.

pipeline: 
- name: "tokenizer_whitespace"
- name: "intent_entity_featurizer_regex"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_spacy"
- name: "intent_classifier_tensorflow_embedding"
  intent_tokenization_flag: true
  intent_split_symbol: "+"
- name: "ner_duckling_http"
  url: "http://0.0.0.0:8000"
  locale: "NL_Nothing"

Will this pipeline works, I understand that tensor flow requires text features and hence instead of using the count vector featurizer , I replaced it with spacY’s that pre-loads the vectors.

Does this makes sense?

I think what you are trying to do should work. I have tried this in the past and it sort of works. You can combine the spacy word vectors and with your own (see pipeline below)

There are a few things that you should keep in mind:

  1. The TF embedding classifier’s default parameters are chosen to work best with the CountVectorFeaturizer. Make sure to tweak the hyperparameters if you only use the spacy vectors.
  2. When combining spacy vectors with the CountVectorFeaturizer make sure to put the CountVectorFeaturizer in the pipeline first. Otherwise it will not featurize the original, raw words but rather the lemmata generated by spacy. This defeats the purpose of training your own vectors on your custom, domain-specific vocabulary a bit.
  3. Spacy’s word vectors have a length of ca. 350. The CountVectorFeaturizer typically produces word vectors with a lot longer shapes, typically a few thousand in length.
  4. In our testing, we did not see any improvements when combining spacy + CountVectorFeaturizer + TF embedding classifier. On our dataset F1 dropped from by 3-4 percentage points. Your milage may vary.
pipeline:
- name: "intent_featurizer_count_vectors"
- name: "nlp_spacy"
- name: "tokenizer_spacy"
- name: "intent_featurizer_spacy"
- name: "intent_classifier_tensorflow_embedding"
1 Like

Interesting, Thanks for the useful insight. I will test it out. We actually have our custom spaCy’s vectors which performs quite well with SVM. However given some of the advantages of tensorflow ranker with OOV and multi intents, it was worth an idea to mix both and see the results.

Let me know the results and the parameters you used. I’d be interested in your findings.

1 Like

I was looking this exact thing. Has someone figured out a standard way of doing this that works? It would be extremely helpful to us since our data set is quite small