Dense word-embeddings with RASA (spaCy)

eron.szalai · February 3, 2021, 11:07am

Hi there!

I have asked a question in the comment section of your CBOW and Skip Gram YT tutorial about implementing similar solutions in RASA, and I believe Vincent replied to me suggesting to ask the question here on RASA Forums, so here goes.

So, I have a general-purpose language model in spaCy (for Hungarian Language) that has dense word-embedding vectors in it. I would like to use these word2vec word-embeddings (token.vector) in my RASA model for better accuracy, but I haven’t found too much info about how one might do that. I haven’t checked the code in great detail yet, however I am not sure that this is even possible without writing custom code into the RASA pipeline.

Question: If possible to use word2vec word-embeddings in the RASA Open Source, how can I do that?

Thank you!

Aron

koaning · February 3, 2021, 11:41am

It’s possible to use custom word2vec embeddings in Rasa, but it depends slightly on how you trained these. Did you use gensim?

Instead of using spaCy directly, I might recommend borrowing a tool from this repository. This library supports features for Hungarian but Rasa also has some extra options.

I maintain a library called rasa-nlu-examples that tries to support many Non-English tools. For example, it supports FastText and Byte-Pair Embeddings, both offer pre-trained embeddings for Hungarian.
There are multi-language BERT embeddings that are supported via our LanguageModelFeaturizer. In particular, we got good feedback on LaBSE. I believe Hungarian is also supported here.

Having mentioned these tools though, I would like to stress that the most important part of an assistant is the data, not the pipeline. I might focus on getting something demo-able first so that you can start collecting feedback from users. I do not speak Hungarian, but I can imagine that DIET and some simple CountVectors will go a long way when you’re starting out.

If there are more specific issues that you’re concerned with, let me know! I’m working on educational material this quarter on Non-English and if there are any specific blockers for you I’d love to understand that some more. I’m also interested in hearing if there are tools that I should add to Rasa-NLU-Examples for Hungarian.

eron.szalai · February 4, 2021, 9:29am

Thank you for your detailed answer Vincent, this is really helpful! I think I am good for now, you’ve recommended quite a few options in your response, so I don’t have any further specific questions.

koaning · February 4, 2021, 9:56am

Cool. Let me know your results though if you can share them.

I’m keen to hear if these embeddings/tools help with the Hungarian use-cases out there. We don’t have a lot of those so any learnings that you might be able to share would be very welcome.

koaning · February 4, 2021, 11:48am

Also, a detail, odds are that I’ll speak at this conference about Hungarian NLP tools that I’ve helped make available.

It’s not happening for a while, but it might be of interest and a nice place to meet up virtually.

Topic		Replies	Views
Foreign language (not English) problem Getting Started with Rasa	6	350	January 28, 2021
How to train Rasa for other language Rasa Open Source	32	4951	August 25, 2020
Word embeddings and RASA NLU Rasa Open Source	5	2035	August 10, 2020
Decision about using a pre-trained words embeddings or not Getting Started with Rasa	3	169	June 3, 2020
First Time Using Rasa with Word Embedding Rasa Open Source	0	220	January 21, 2022

Dense word-embeddings with RASA (spaCy)

Related topics