Foreign language (not English) problem

I’d like to develop a bot in Hungarian. SpaCy doesn’t have Hungarian language support yet, but I could download a trained model from fastText. How can I make my chatbot work based on this dataset? How do I import it?

One of the good things about the embedding classifier that rasa-nlu uses is that it isn’t built on an assumption of the language being used. This means that you could use it with any language or combination of languages that you want and still get good results with it.

This feature is only available in the supervised embedding pipeline.

You could simply replace your pretrained_embeddings_spacy with supervised_embeddings in the config file and it should work. If the language you’re working on cannot be tokenized by whitespace then you may want to implement a custom tokenizer. One of the cons of using this method is that you won’t be able to get ner capabilities that spacy provides (if it did, you wouldn’t be asking this question though).

Hope that helps.

Could you please elaborate on the details a bit? If I would use a package from spaCy I could download it like this:

python -m spacy download en

This would automatically download and install the English data set and append it to “en” abbreviation in Rasa. What do I have to do with my fastText Hungarian dataset to be able to use it the same way in Python, just by referencing “hu”?

I haven’t found any source on this but if you can give me a link where I can read up on the details (not the basic theory of how it should work), that works too. Thanks.

Spacy has a built-in option to convert fasttext vectors to spacy vectors. Details here: https://spacy.io/usage/vectors-similarity#converting

After you covert this, you can’t really load it by just specifying language like language: hu. (Maybe you can load it with language: "/tmp/la_vectors_wiki_lg". Maybe? I haven’t tried this yet.) You need to build the package and install the language separately. (I forgot how I did it.) Instead of going through all this trouble, you can subclass the SpacyNLP components and override the create / load method to load your model.

Hope that helps.

1 Like

Could you tell me how exactly to overwrite the SpacyNLP?

I see in the Components doc that it should go somewhat like this:

pipeline:
- name: "SpacyNLP"
  # language model to load
  model: "en_core_web_md"

  # when retrieving word vectors, this will decide if the casing
  # of the word is relevant. E.g. `hello` and `Hello` will
  # retrieve the same vector, if set to `false`. For some
  # applications and models it makes sense to differentiate
  # between these two words, therefore setting this to `true`.
  case_sensitive: false

But what is that model? Shouldn’t it have a format or something? The file I ended up with after converting from FastText to Spacy is called cc.hu.300.vec.gz. What folder should the file be in for the program to be able to load it? Where do I declare the language hu? Or is that not necessary? Should this be the first pipeline? Or is several pipelines even allowed in the config.yml file? Or should this overwrite the pipeline that’s already in the file which is:

language: hu
pipeline: supervised_embeddings

It would be nice to have more detailed docs… :weary:

@amn41 could you give me an answer to this please? Thanks.

I recognize that I’m late to the question here but FastText is now available directly in Rasa via the rasa-nlu-examples project that I maintain.

AskMeAnything[tm].

1 Like