How to train Rasa for other language

Hi All,

I’m reading about the related subject here https://rasa.com/docs/rasa/nlu/language-support/

Is there any sample about this one?

Cheers

What language are you interested in? We may not have pre-trained word embeddings in every language but the base count vectorizer approach should work on any language that we can tokenize.

i’m trying to use fasttext pre trained… i see that it’s supported

is there any sample on how to use it? thanks alot

There are two options for fasttext.

  • Option 1: Load fasttext into spaCy and then load spaCy into Rasa. You might find this guide helpful on how to link spaCy with Rasa.
  • Option 2: I’ve open sourced a new project called rasa_nlu_examples recently to make this process a whole lot easier. You can read the announcement here. It’s a side project that I maintain and the idea is that it is sortof a contrib-like project. We have two word embeddings available from there that you can play with: fasttext and bytepair. The bytepair embeddings are availabe in 275 languages. For more information on how to set up fasttext via this route can be found here and you might also find the benchmarking guide useful.

If you end up using the 2nd option, feel free to let me know on github if there’s any bugs/features you’d like me to consider.

1 Like

nice… will check this now…

thanks a lot for your help.

cheers

1 Like

Hi currently trying Option 2. Looks like it supported my language of choices.

Do you have any comparison fasttext vs bytepair? What’s the criteria to choose from both?

I see a couple of languages for my needs is covered in both…

Cheers

oh ya, is there any plan on merging the contrib into the core rasa? i think it’s an awesome stuff!

Cheers

i’ve found an issues on following benchmarking guide… has already post it there

thanks a lot

The idea behind the repository is that if it turns out that a feature is super useful then, yes!, we can move it into Rasa. But there’s a lot out there and Rasa needs to remain stable. That is why this repository was created. We can experiment just a bit more and get feedback this way.

It should also make it easier for you to write custom components because of the examples that are already there. Let me know if there’s features missing on github.

1 Like

Thank for your help in github also.

Currently i’m trying to follow your instruction there.

Will let you know the result asap.

Cheers

Hi @koaning, It works now! Not sure why. I use bytepair featurize.

  1. I’ve created a clean conda environment
  2. create new fresh project with rasa init --no-prompt
  3. change the config.yml into these
  4. edit the config to use indonesian (id)

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: id
pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
    lang: en
    vs: 1000
    dim: 25
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy

Basically i just do modification from the new project and adding this one

  - name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
    lang: en
    vs: 1000
    dim: 25

I run rasa train. It works well.

So what’s next? I think i should try changing the data/nlu.md into indonesian language dataset? Then run rasa train again?

Am i in the right direction?

Cheers

Hi @koaning, basically i should make this language code the same right? What’s the different between the “language: id” and “lang: id”. how it will be used in the training part? Thanks

The id at the top indicates the languages setting on a pipeline leven and in general I think it is supposed to be the same as the lang id in the bytepair settings. I don’t know though if the two letter abbreviations in BytePair are the same what Rasa uses. I don’t know if there’s an international standard for so it’s good to manually check the BytePair website.

thanks. crystal clear.

confirm after seeing the documentation about this regarding the country code.

so the next step should be changing the data/nlu.md into Indonesian language dataset? Then run rasa train again?

Am i in the right direction?

just seen your reply at github @koaning posting that here. so everyone can see the end to end if they need. thanks a lot. marking this thread solved.

Hi!

Thank you for this clarification. I am currently working on making a pipeline for Estonian. I chose the first option and kept the suggested pipeline (here):

‘’’ # Configuration for Rasa NLU. # Components language: et_model

pipeline:
  - name: SpacyNLP
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector

‘’’

I understand now that the CountVectorsFeaturizer is written twice on purpose. But going through the pipeline another question arose. In LexicalSyntacticFeaturizer one of the features it generates is

pos             Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).

I can see that the only benefit from Fasttext wrapped into Spacy is the word vectors. Loading Fasttext model into Spacy doesn’t give me lemmatization nor POS-es.

Might this influence the performance somehow? I mean if Spacy doesn’t give POS-es, then I expect the LexicalSyntacticFeaturizer just gives NaN for every POS value…

The LexicalSyntacticFeaturizer adds things like “does this word start with a capital letter”. This is very different from the POS features that spaCy generates.

The spaCy featurizer only adds the word vector features to my knowledge. It does not add POS information. I’d argue that it’s plausible though that POS features could make entity detection easier down the line. That is why there’s an open ticket on the Rasa NLU examples repository.

Just to be clear, the LexicalSyntacticFeaturizer is not related to spaCy in our implementation.

Thank you for your quick answer, @koaning! In that case the docs are a bit wrong? Why else is there written that SpacyTokenizer (I am not talking about spaCy featurizer) is needed for POS values. When I look at the code in the lexical_syntactic_featurizer.py it imports SpacyTokenizer:

line 8 from rasa.nlu.tokenizers.spacy_tokenizer import POS_TAG_KEY

and it also takes POS tags from the token:

line 65      "pos": lambda token: token.data.get(POS_TAG_KEY)
line 66        if POS_TAG_KEY in token.data
line 67        else None,
line 68         "pos2": lambda token: token.data.get(POS_TAG_KEY)[:2]
line 69         if "pos" in token.data
line 70          else None,

In the spacy_tokenizer.py in line 40 it adds tag_ (which are the detailed part-of-speech tag according to spaCy docs) to the token.

Therefore it seems to me that the LexicalSyntacticFeaturizer is related to spaCy (and gives None to the pos and pos2 features otherwise) and if I want to get the maximum out of Rasa I have to consider the fact that implementing Fasttext via spaCy lacks of linguistic data (gives only wordvectors) and I should give POS-es and lemmas via other resources. What do you think?

@Lindafr I wasn’t aware of the pos item in the docs. Interesting.

I want to double check this now. Will report back in a few minutes with an extensive answer. Odds are that you’re totally correcting me on something now! (Well done!)