Will this pipeline works, I understand that tensor flow requires text features and hence instead of using the count vector featurizer , I replaced it with spacY’s that pre-loads the vectors.
I think what you are trying to do should work. I have tried this in the past and it sort of works. You can combine the spacy word vectors and with your own (see pipeline below)
There are a few things that you should keep in mind:
The TF embedding classifier’s default parameters are chosen to work best with the CountVectorFeaturizer. Make sure to tweak the hyperparameters if you only use the spacy vectors.
When combining spacy vectors with the CountVectorFeaturizer make sure to put the CountVectorFeaturizer in the pipeline first. Otherwise it will not featurize the original, raw words but rather the lemmata generated by spacy. This defeats the purpose of training your own vectors on your custom, domain-specific vocabulary a bit.
Spacy’s word vectors have a length of ca. 350. The CountVectorFeaturizer typically produces word vectors with a lot longer shapes, typically a few thousand in length.
In our testing, we did not see any improvements when combining spacy + CountVectorFeaturizer + TF embedding classifier. On our dataset F1 dropped from by 3-4 percentage points. Your milage may vary.
Interesting, Thanks for the useful insight. I will test it out. We actually have our custom spaCy’s vectors which performs quite well with SVM. However given some of the advantages of tensorflow ranker with OOV and multi intents, it was worth an idea to mix both and see the results.
I was looking this exact thing. Has someone figured out a standard way of doing this that works? It would be extremely helpful to us since our data set is quite small