I have asked a question in the comment section of your CBOW and Skip Gram YT tutorial about implementing similar solutions in RASA, and I believe Vincent replied to me suggesting to ask the question here on RASA Forums, so here goes.
So, I have a general-purpose language model in spaCy (for Hungarian Language) that has dense word-embedding vectors in it. I would like to use these word2vec word-embeddings (token.vector) in my RASA model for better accuracy, but I haven’t found too much info about how one might do that. I haven’t checked the code in great detail yet, however I am not sure that this is even possible without writing custom code into the RASA pipeline.
Question: If possible to use word2vec word-embeddings in the RASA Open Source, how can I do that?
It’s possible to use custom word2vec embeddings in Rasa, but it depends slightly on how you trained these. Did you use gensim?
Instead of using spaCy directly, I might recommend borrowing a tool from this repository. This library supports features for Hungarian but Rasa also has some extra options.
I maintain a library called rasa-nlu-examples that tries to support many Non-English tools. For example, it supports FastText and Byte-Pair Embeddings, both offer pre-trained embeddings for Hungarian.
There are multi-language BERT embeddings that are supported via our LanguageModelFeaturizer. In particular, we got good feedback on LaBSE. I believe Hungarian is also supported here.
Having mentioned these tools though, I would like to stress that the most important part of an assistant is the data, not the pipeline. I might focus on getting something demo-able first so that you can start collecting feedback from users. I do not speak Hungarian, but I can imagine that DIET and some simple CountVectors will go a long way when you’re starting out.
If there are more specific issues that you’re concerned with, let me know! I’m working on educational material this quarter on Non-English and if there are any specific blockers for you I’d love to understand that some more. I’m also interested in hearing if there are tools that I should add to Rasa-NLU-Examples for Hungarian.
Thank you for your detailed answer Vincent, this is really helpful! I think I am good for now, you’ve recommended quite a few options in your response, so I don’t have any further specific questions.
Cool. Let me know your results though if you can share them.
I’m keen to hear if these embeddings/tools help with the Hungarian use-cases out there. We don’t have a lot of those so any learnings that you might be able to share would be very welcome.