Easiest way to finetune word vectors

I have been using spacy (en_web_core_md) in a pipeline, but I’d like to fine-tune on my own (non-rasa-training) data. I know there are many options for this, but I’d love thoughts on how to best accomplish it.

Requirements:

  • I don’t want brand new word vectors, I want to extend the existing spacy ones or a similarly sized set with similar word coverage
  • I don’t want it to be much larger (don’t want to start with a huge .vec unless there’s good guidance on pruning)

Thoughts:

  • the spacy documentation suggests using gensim to run w2v, but I would want to confirm others have had success extending word vectors in this way
  • I’d be happy to add a custom GensimFeaturizer to keep my domain-specific vectors separate, though then I’d likely eat up some extra memory and my domain-specific vectors might be less generalized since they would be trained on a smaller corpus

I’m curious if anyone has experimented with any of these routes and has opinions!

Hi @jamesmf , if I understand you correctly, you plan to add a featurizer which loads the pre-trained vectors, which is currently possible with SpacyFeaturizer but you also want to fine tune them? If you add just a featurizer and embedding weights for word vectors are not carried over to the intent classifier, they will not get fine tuned because they will not be a part of the tensorflow graph. So, in order to fine-tune you will have to make featurization inside your tensorflow graph in EmbeddingIntentClassifier

Thanks for the reply!

That’s not quite what I meant (although an EmbeddingInitializer is an interesting feature idea), here’s what is different.

I don’t mean fine-tuning on my labeled, rasa training data. I mean pretraining on some larger corpus of my own. So my goal would be to start with say en_web_core_md vectors from spacy, which are great for generalization. But then I’d like to also have word vectors that are from my other domain-specific corpus. I’m open to these replacing the en_core_web_md vectors if it’s easy to use those vectors as a starting point for my domain-specific training. But I haven’t found a great walkthrough on how well the (spacy --> gensim for retraining --> back to spacy) process works.

I would also be open to adding a new component to the pipeline that adds more word vectors (my domain-specific ones) in addition to the spacy ones. Or I’d consider having two spacy models, though the simplest way to do that right now is quite hacky, and the memory requirements might be not worthwhile.

Here’s an example to illustrate. Imagine I work for an online retail company that has a chatbot. I have a small set of labeled utterances for training rasa nlu. My pipeline uses spacy en_core_web_md to start. But I also have a big dataset of comments people have left on my products, which would be great for pretraining. I don’t want to simply create a new spacy model based on only that, since then I’d lose all the training spacy did on the OntoNotes corpus.

Ideally I could extend en_web_core_md using a technique like w2v on my unlabeled comments dataset, but I haven’t seen examples of that in spacy, and it seems vulnerable to catastrophic forgetting. So perhaps I should train a separate w2v model and make those features also available to my model using something like a “GensimFeaturizer

This seems like a really common pattern (where the raw text data available to you far exceeds the labeled utterances and your domain is a little specific), so I’m wondering what a good solution might look like.

1 Like