Custom word vectors and pipelines using them

bayesianwannabe · March 10, 2020, 10:34pm

Hey all,

I have a set of word vectors that I obtained applying word2vec on PySpark on a big unsupervised domain specific corpus.

Now, I do know that we have as options for the intent classifier the SVM and also the StarSpace based model (now we also got the DIET architecture), but studying the components I saw that the “EmbeddingIntentClassifier” component gets its word vectors from the supervised set provided on the nlu.md file, and I saw that the spacy pipeline uses the “SklearnIntentClassifier” as the final component.

I wonder then if it makes sense or if it’s useless to import my customized word vectors and also use the “EmbeddingIntentClassifier” (as SVM didn’t provided results as good as StarSpace, both using my ‘in-house’ embeddings).

To insert my word vectors in the pipeline, I did this on shell: python -m spacy init-model pt ./my_embedding/spacy.word2vec.model --vectors-loc ./my_embedding/embeddings.txt.gz

Then, in my pipeline, I called the following components (notice that I also want to use the brazilian portuguese tokenization from spacy):

language : "pt"
pipeline:
  - name: "SpacyNLP"
    model: "my_embedding/spacy.word2vec.model"
  - name: "SpacyTokenizer"
  - name: "CountVectorsFeaturizer"
  - name: "SpacyFeaturizer"
  - name: "CRFEntityExtractor"
  - name: "EntitySynonymMapper"
  - name: "EmbeddingIntentClassifier"
    epochs: 200
    random_seed: 42

Does it make sense? Am I doing something wrong? I mean, the classifier trains and predicts fine, but I can’t see substantial difference when I don’t declare the custom word embeddings.

Before concluding that these in-house word vectors of mine doesn’t help much, I still have some doubts I would like to solve first:

Can “EmbeddingIntentClassifier” still take as features/covariates the word vectors that I made available on “my_embedding/spacy.word2vec.model” or these vectors are just ignored?
Does calling CountVectorsFeaturizer and then SpacyFeaturizer provides the desired effect of using both BoW features and word vectors as features?
The way I created “my_embedding/spacy.word2vec.model” preserves the other aspects of the spacy tokenizer of the PT-BR pre-trained corpus (official from spacy)?
What is the most indicated pipeline when importing customized domain specific word vectors?

Sorry for the bunch of questions… any help on this would be great!

Ghostvv · March 16, 2020, 1:42pm

if you linked your word vectors to spacy correctly, then SpacyFeaturizer should provide additional dense features based on your custom word vectors for EmbeddingIntentClassifier in versions greater than 1.7

bayesianwannabe · March 16, 2020, 8:32pm

The way I described for linking the word vectors with the ‘init-model’ and ‘vectors-loc’ parameter seems ok?

Is there a better way or do you have a template script for doing this?

Thanks!

Ghostvv · March 17, 2020, 10:00am

It’s correct

bayesianwannabe · March 19, 2020, 10:50pm

Do you know if I should declare more items in the following part if I want to preserve the portuguese brazilian tokenization, overwriting the language model only in the part of the embeddings?

name: “SpacyNLP” model: “my_embedding/spacy.word2vec.model”

Like… Am I preserving the tokens usually parsed by spacy with the presented pipeline? I ask this because my ‘???’ are disappearing and the vanilla spacy does preserve them as tokens.

Ghostvv · March 20, 2020, 9:22am

you can hack into SpacyTokenizer and print the tokens, to see whether they are as you want

Topic		Replies	Views
Supervised Embeddings Pipeline initialized with pre-trained word embeddings Rasa Open Source	0	666	November 27, 2019
spaCy and OOV-Tokens Rasa Open Source	1	628	July 29, 2020
Pre-load vectors for Tensorflow ranking classifier pipeline Rasa Open Source	4	1619	September 1, 2018
Confusion on SpacyNLP pipeline Rasa Open Source	0	132	May 1, 2024
Spacy with supervised_embeddings pipeline Rasa Open Source	0	342	February 18, 2020

Custom word vectors and pipelines using them

Related topics