I have a set of word vectors that I obtained applying word2vec on PySpark on a big unsupervised domain specific corpus.
Now, I do know that we have as options for the intent classifier the SVM and also the StarSpace based model (now we also got the DIET architecture), but studying the components I saw that the “EmbeddingIntentClassifier” component gets its word vectors from the supervised set provided on the nlu.md file, and I saw that the spacy pipeline uses the “SklearnIntentClassifier” as the final component.
I wonder then if it makes sense or if it’s useless to import my customized word vectors and also use the “EmbeddingIntentClassifier” (as SVM didn’t provided results as good as StarSpace, both using my ‘in-house’ embeddings).
To insert my word vectors in the pipeline, I did this on shell:
python -m spacy init-model pt ./my_embedding/spacy.word2vec.model --vectors-loc ./my_embedding/embeddings.txt.gz
Then, in my pipeline, I called the following components (notice that I also want to use the brazilian portuguese tokenization from spacy):
language : "pt" pipeline: - name: "SpacyNLP" model: "my_embedding/spacy.word2vec.model" - name: "SpacyTokenizer" - name: "CountVectorsFeaturizer" - name: "SpacyFeaturizer" - name: "CRFEntityExtractor" - name: "EntitySynonymMapper" - name: "EmbeddingIntentClassifier" epochs: 200 random_seed: 42
Does it make sense? Am I doing something wrong? I mean, the classifier trains and predicts fine, but I can’t see substantial difference when I don’t declare the custom word embeddings.
Before concluding that these in-house word vectors of mine doesn’t help much, I still have some doubts I would like to solve first:
Can “EmbeddingIntentClassifier” still take as features/covariates the word vectors that I made available on “my_embedding/spacy.word2vec.model” or these vectors are just ignored?
Does calling CountVectorsFeaturizer and then SpacyFeaturizer provides the desired effect of using both BoW features and word vectors as features?
The way I created “my_embedding/spacy.word2vec.model” preserves the other aspects of the spacy tokenizer of the PT-BR pre-trained corpus (official from spacy)?
What is the most indicated pipeline when importing customized domain specific word vectors?
Sorry for the bunch of questions… any help on this would be great!