How to train Rasa for other language

welly87 · August 11, 2020, 4:08am

Hi All,

I’m reading about the related subject here https://rasa.com/docs/rasa/nlu/language-support/

Is there any sample about this one?

Cheers

koaning · August 13, 2020, 11:13am

What language are you interested in? We may not have pre-trained word embeddings in every language but the base count vectorizer approach should work on any language that we can tokenize.

welly87 · August 13, 2020, 11:29am

i’m trying to use fasttext pre trained… i see that it’s supported

is there any sample on how to use it? thanks alot

koaning · August 13, 2020, 11:42am

There are two options for fasttext.

Option 1: Load fasttext into spaCy and then load spaCy into Rasa. You might find this guide helpful on how to link spaCy with Rasa.
Option 2: I’ve open sourced a new project called rasa_nlu_examples recently to make this process a whole lot easier. You can read the announcement here. It’s a side project that I maintain and the idea is that it is sortof a contrib-like project. We have two word embeddings available from there that you can play with: fasttext and bytepair. The bytepair embeddings are availabe in 275 languages. For more information on how to set up fasttext via this route can be found here and you might also find the benchmarking guide useful.

If you end up using the 2nd option, feel free to let me know on github if there’s any bugs/features you’d like me to consider.

welly87 · August 13, 2020, 12:24pm

nice… will check this now…

thanks a lot for your help.

cheers

welly87 · August 14, 2020, 1:35am

Hi currently trying Option 2. Looks like it supported my language of choices.

Do you have any comparison fasttext vs bytepair? What’s the criteria to choose from both?

I see a couple of languages for my needs is covered in both…

Cheers

welly87 · August 14, 2020, 1:37am

oh ya, is there any plan on merging the contrib into the core rasa? i think it’s an awesome stuff!

Cheers

welly87 · August 14, 2020, 3:28am

i’ve found an issues on following benchmarking guide… has already post it there

github.com/RasaHQ/rasa-nlu-examples

Tensorflow error while running benchmarking guide

opened 03:27AM - 14 Aug 20 UTC

closed 02:02PM - 18 Aug 20 UTC

welly87

Hi @koaning I follow benchmarking guideline here https://rasahq.github.io/r…asa-nlu-examples/benchmarking/ but found this error ``` (binus) Wellys-MacBook-Pro:rasa-demo wellytambunan$ rasa test nlu --config basic-bytepair.config.yml --cross-validation --runs 1 --folds 2 --out gridresults/basic-bytepair-config 2020-08-14 10:22:31 INFO rasa.cli.test - Test model using cross validation. Traceback (most recent call last): File "/Users/wellytambunan/opt/anaconda3/envs/binus/bin/rasa", line 8, in <module> sys.exit(main()) File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/__main__.py", line 92, in main cmdline_arguments.func(cmdline_arguments) File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/cli/test.py", line 147, in run_nlu_test perform_nlu_cross_validation(config, nlu_data, output, vars(args)) File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/test.py", line 243, in perform_nlu_cross_validation data, folds, nlu_config, output, **kwargs File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/test.py", line 1354, in cross_validate trainer = Trainer(nlu_config) File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/model.py", line 142, in __init__ components.validate_requirements(cfg.component_names) File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/components.py", line 46, in validate_requirements from rasa.nlu import registry File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/registry.py", line 13, in <module> from rasa.nlu.classifiers.diet_classifier import DIETClassifier File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 9, in <module> import tensorflow_addons as tfa File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/__init__.py", line 21, in <module> from tensorflow_addons import activations File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/activations/__init__.py", line 21, in <module> from tensorflow_addons.activations.gelu import gelu File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/activations/gelu.py", line 24, in <module> get_path_to_datafile("custom_ops/activations/_activation_ops.so")) File "/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename) **tensorflow.python.framework.errors_impl.NotFoundError:** dlopen(/Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so, 6): **Symbol not found:** __ZN10tensorflow11GetNodeAttrERKNS_9AttrSliceEN4absl11string_viewEPb Referenced from: /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so Expected in: /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.2.dylib in /Users/wellytambunan/opt/anaconda3/envs/binus/lib/python3.6/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so (binus) Wellys-MacBook-Pro:rasa-demo wellytambunan$ ```

thanks a lot

koaning · August 14, 2020, 9:46am

The idea behind the repository is that if it turns out that a feature is super useful then, yes!, we can move it into Rasa. But there’s a lot out there and Rasa needs to remain stable. That is why this repository was created. We can experiment just a bit more and get feedback this way.

It should also make it easier for you to write custom components because of the examples that are already there. Let me know if there’s features missing on github.

welly87 · August 14, 2020, 4:27pm

Thank for your help in github also.

Currently i’m trying to follow your instruction there.

Will let you know the result asap.

Cheers

welly87 · August 14, 2020, 11:42pm

Hi @koaning, It works now! Not sure why. I use bytepair featurize.

I’ve created a clean conda environment
create new fresh project with rasa init --no-prompt
change the config.yml into these
edit the config to use indonesian (id)


# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: id
pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
    lang: en
    vs: 1000
    dim: 25
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy

Basically i just do modification from the new project and adding this one

  - name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
    lang: en
    vs: 1000
    dim: 25

I run rasa train. It works well.

So what’s next? I think i should try changing the data/nlu.md into indonesian language dataset? Then run rasa train again?

Am i in the right direction?

Cheers

welly87 · August 14, 2020, 11:52pm

Hi @koaning, basically i should make this language code the same right? What’s the different between the “language: id” and “lang: id”. how it will be used in the training part? Thanks

koaning · August 15, 2020, 11:53am

The id at the top indicates the languages setting on a pipeline leven and in general I think it is supposed to be the same as the lang id in the bytepair settings. I don’t know though if the two letter abbreviations in BytePair are the same what Rasa uses. I don’t know if there’s an international standard for so it’s good to manually check the BytePair website.

welly87 · August 18, 2020, 1:14am

thanks. crystal clear.

confirm after seeing the documentation about this regarding the country code.

welly87 · August 18, 2020, 1:15am

so the next step should be changing the data/nlu.md into Indonesian language dataset? Then run rasa train again?

Am i in the right direction?

welly87 · August 18, 2020, 1:20am

just seen your reply at github @koaning posting that here. so everyone can see the end to end if they need. thanks a lot. marking this thread solved.

Lindafr · August 18, 2020, 7:35am

Hi!

Thank you for this clarification. I am currently working on making a pipeline for Estonian. I chose the first option and kept the suggested pipeline (here):

‘’’ # Configuration for Rasa NLU. # Components language: et_model

pipeline:
  - name: SpacyNLP
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector

‘’’

I understand now that the CountVectorsFeaturizer is written twice on purpose. But going through the pipeline another question arose. In LexicalSyntacticFeaturizer one of the features it generates is

pos             Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).

I can see that the only benefit from Fasttext wrapped into Spacy is the word vectors. Loading Fasttext model into Spacy doesn’t give me lemmatization nor POS-es.

Might this influence the performance somehow? I mean if Spacy doesn’t give POS-es, then I expect the LexicalSyntacticFeaturizer just gives NaN for every POS value…

koaning · August 18, 2020, 11:18am

The LexicalSyntacticFeaturizer adds things like “does this word start with a capital letter”. This is very different from the POS features that spaCy generates.

The spaCy featurizer only adds the word vector features to my knowledge. It does not add POS information. I’d argue that it’s plausible though that POS features could make entity detection easier down the line. That is why there’s an open ticket on the Rasa NLU examples repository.

Just to be clear, the LexicalSyntacticFeaturizer is not related to spaCy in our implementation.

Lindafr · August 19, 2020, 6:39am

Thank you for your quick answer, @koaning! In that case the docs are a bit wrong? Why else is there written that SpacyTokenizer (I am not talking about spaCy featurizer) is needed for POS values. When I look at the code in the lexical_syntactic_featurizer.py it imports SpacyTokenizer:

line 8 from rasa.nlu.tokenizers.spacy_tokenizer import POS_TAG_KEY

and it also takes POS tags from the token:

line 65      "pos": lambda token: token.data.get(POS_TAG_KEY)
line 66        if POS_TAG_KEY in token.data
line 67        else None,
line 68         "pos2": lambda token: token.data.get(POS_TAG_KEY)[:2]
line 69         if "pos" in token.data
line 70          else None,

In the spacy_tokenizer.py in line 40 it adds tag_ (which are the detailed part-of-speech tag according to spaCy docs) to the token.

Therefore it seems to me that the LexicalSyntacticFeaturizer is related to spaCy (and gives None to the pos and pos2 features otherwise) and if I want to get the maximum out of Rasa I have to consider the fact that implementing Fasttext via spaCy lacks of linguistic data (gives only wordvectors) and I should give POS-es and lemmas via other resources. What do you think?

koaning · August 19, 2020, 12:52pm

@Lindafr I wasn’t aware of the pos item in the docs. Interesting.

I want to double check this now. Will report back in a few minutes with an extensive answer. Odds are that you’re totally correcting me on something now! (Well done!)

Topic		Replies	Views
Using fasttext pretrained word emmbeding for other language Rasa Open Source	3	1010	August 18, 2020
spaCy pretrained models break chatbot NLU capacities Rasa Open Source	6	780	October 16, 2019
Custom Featurizer for finetuned BERT features based on SpaCy Rasa Open Source	19	4523	February 12, 2020
SpacyNLP and supervised_embeddings in the same pipeline Rasa Open Source	0	447	May 21, 2019
Support for Language Models inside Rasa Release Announcements community , rasa	25	12763	November 25, 2021

How to train Rasa for other language

Related topics