Thanks for the reply!
That’s not quite what I meant (although an
EmbeddingInitializer is an interesting feature idea), here’s what is different.
I don’t mean fine-tuning on my labeled, rasa training data. I mean pretraining on some larger corpus of my own. So my goal would be to start with say
en_web_core_md vectors from spacy, which are great for generalization. But then I’d like to also have word vectors that are from my other domain-specific corpus. I’m open to these replacing the
en_core_web_md vectors if it’s easy to use those vectors as a starting point for my domain-specific training. But I haven’t found a great walkthrough on how well the (
gensim for retraining --> back to
spacy) process works.
I would also be open to adding a new component to the pipeline that adds more word vectors (my domain-specific ones) in addition to the
spacy ones. Or I’d consider having two
spacy models, though the simplest way to do that right now is quite hacky, and the memory requirements might be not worthwhile.
Here’s an example to illustrate. Imagine I work for an online retail company that has a chatbot. I have a small set of labeled utterances for training rasa nlu. My pipeline uses
en_core_web_md to start. But I also have a big dataset of comments people have left on my products, which would be great for pretraining. I don’t want to simply create a new
spacy model based on only that, since then I’d lose all the training
spacy did on the
Ideally I could extend
en_web_core_md using a technique like w2v on my unlabeled comments dataset, but I haven’t seen examples of that in
spacy, and it seems vulnerable to catastrophic forgetting. So perhaps I should train a separate w2v model and make those features also available to my model using something like a “
This seems like a really common pattern (where the raw text data available to you far exceeds the labeled utterances and your domain is a little specific), so I’m wondering what a good solution might look like.