Foreign language (not English) problem

ggabor · October 1, 2019, 8:17pm

I’d like to develop a bot in Hungarian. SpaCy doesn’t have Hungarian language support yet, but I could download a trained model from fastText. How can I make my chatbot work based on this dataset? How do I import it?

lahsuk · October 2, 2019, 4:57am

One of the good things about the embedding classifier that rasa-nlu uses is that it isn’t built on an assumption of the language being used. This means that you could use it with any language or combination of languages that you want and still get good results with it.

This feature is only available in the supervised embedding pipeline.

You could simply replace your pretrained_embeddings_spacy with supervised_embeddings in the config file and it should work. If the language you’re working on cannot be tokenized by whitespace then you may want to implement a custom tokenizer. One of the cons of using this method is that you won’t be able to get ner capabilities that spacy provides (if it did, you wouldn’t be asking this question though).

Hope that helps.

ggabor · October 2, 2019, 9:05am

Could you please elaborate on the details a bit? If I would use a package from spaCy I could download it like this:

python -m spacy download en

This would automatically download and install the English data set and append it to “en” abbreviation in Rasa. What do I have to do with my fastText Hungarian dataset to be able to use it the same way in Python, just by referencing “hu”?

I haven’t found any source on this but if you can give me a link where I can read up on the details (not the basic theory of how it should work), that works too. Thanks.

lahsuk · October 2, 2019, 10:08am

Spacy has a built-in option to convert fasttext vectors to spacy vectors. Details here: https://spacy.io/usage/vectors-similarity#converting

After you covert this, you can’t really load it by just specifying language like language: hu. (Maybe you can load it with language: "/tmp/la_vectors_wiki_lg". Maybe? I haven’t tried this yet.) You need to build the package and install the language separately. (I forgot how I did it.) Instead of going through all this trouble, you can subclass the SpacyNLP components and override the create / load method to load your model.

Hope that helps.

ggabor · October 6, 2019, 12:22pm

Could you tell me how exactly to overwrite the SpacyNLP?

I see in the Components doc that it should go somewhat like this:

pipeline:
- name: "SpacyNLP"
  # language model to load
  model: "en_core_web_md"

  # when retrieving word vectors, this will decide if the casing
  # of the word is relevant. E.g. `hello` and `Hello` will
  # retrieve the same vector, if set to `false`. For some
  # applications and models it makes sense to differentiate
  # between these two words, therefore setting this to `true`.
  case_sensitive: false

But what is that model? Shouldn’t it have a format or something? The file I ended up with after converting from FastText to Spacy is called cc.hu.300.vec.gz. What folder should the file be in for the program to be able to load it? Where do I declare the language hu? Or is that not necessary? Should this be the first pipeline? Or is several pipelines even allowed in the config.yml file? Or should this overwrite the pipeline that’s already in the file which is:

language: hu
pipeline: supervised_embeddings

It would be nice to have more detailed docs…

ggabor · October 10, 2019, 6:08pm

@amn41 could you give me an answer to this please? Thanks.

koaning · January 28, 2021, 2:54pm

I recognize that I’m late to the question here but FastText is now available directly in Rasa via the rasa-nlu-examples project that I maintain.

AskMeAnything[tm].

Topic		Replies	Views
Dense word-embeddings with RASA (spaCy) Rasa Open Source	4	930	February 4, 2021
How to train Rasa for other language Rasa Open Source	32	4891	August 25, 2020
On premise Arabic chatbot Getting Started with Rasa	5	270	October 6, 2018
Rasa is also good for languages other than English? Rasa Open Source	2	1407	September 19, 2019
Word embeddings and RASA NLU Rasa Open Source	5	2029	August 10, 2020

Foreign language (not English) problem

Related topics