Custom word vectors and pipelines using them

Hey all,

I have a set of word vectors that I obtained applying word2vec on PySpark on a big unsupervised domain specific corpus.

Now, I do know that we have as options for the intent classifier the SVM and also the StarSpace based model (now we also got the DIET architecture), but studying the components I saw that the “EmbeddingIntentClassifier” component gets its word vectors from the supervised set provided on the file, and I saw that the spacy pipeline uses the “SklearnIntentClassifier” as the final component.

I wonder then if it makes sense or if it’s useless to import my customized word vectors and also use the “EmbeddingIntentClassifier” (as SVM didn’t provided results as good as StarSpace, both using my ‘in-house’ embeddings).

To insert my word vectors in the pipeline, I did this on shell: python -m spacy init-model pt ./my_embedding/spacy.word2vec.model --vectors-loc ./my_embedding/embeddings.txt.gz

Then, in my pipeline, I called the following components (notice that I also want to use the brazilian portuguese tokenization from spacy):

language : "pt"
  - name: "SpacyNLP"
    model: "my_embedding/spacy.word2vec.model"
  - name: "SpacyTokenizer"
  - name: "CountVectorsFeaturizer"
  - name: "SpacyFeaturizer"
  - name: "CRFEntityExtractor"
  - name: "EntitySynonymMapper"
  - name: "EmbeddingIntentClassifier"
    epochs: 200
    random_seed: 42

Does it make sense? Am I doing something wrong? I mean, the classifier trains and predicts fine, but I can’t see substantial difference when I don’t declare the custom word embeddings.

Before concluding that these in-house word vectors of mine doesn’t help much, I still have some doubts I would like to solve first:

  • Can “EmbeddingIntentClassifier” still take as features/covariates the word vectors that I made available on “my_embedding/spacy.word2vec.model” or these vectors are just ignored?

  • Does calling CountVectorsFeaturizer and then SpacyFeaturizer provides the desired effect of using both BoW features and word vectors as features?

  • The way I created “my_embedding/spacy.word2vec.model” preserves the other aspects of the spacy tokenizer of the PT-BR pre-trained corpus (official from spacy)?

  • What is the most indicated pipeline when importing customized domain specific word vectors?

Sorry for the bunch of questions… any help on this would be great!

1 Like

if you linked your word vectors to spacy correctly, then SpacyFeaturizer should provide additional dense features based on your custom word vectors for EmbeddingIntentClassifier in versions greater than 1.7

The way I described for linking the word vectors with the ‘init-model’ and ‘vectors-loc’ parameter seems ok?

Is there a better way or do you have a template script for doing this?


It’s correct

Do you know if I should declare more items in the following part if I want to preserve the portuguese brazilian tokenization, overwriting the language model only in the part of the embeddings?

  • name: “SpacyNLP” model: “my_embedding/spacy.word2vec.model”

Like… Am I preserving the tokens usually parsed by spacy with the presented pipeline? I ask this because my ‘???’ are disappearing and the vanilla spacy does preserve them as tokens.

you can hack into SpacyTokenizer and print the tokens, to see whether they are as you want