How to train Rasa for other language

koaning · August 19, 2020, 1:10pm

Yep. I totally learned something today. Thanks @Lindafr!

So here’s what I’ve been able to confirm: yes you can add spaCy POS features using the LexicalSyntacticFeaturizer. This was unknown to me before but I can confirm that indeed it adds features. My previous answer was wrong.

Over on the rasa nlu examples project I’ve made a component called rasa_nlu_examples.meta.Printer. This component allows you to print some extra information from the component pipeline. Here’s two examples of pipelines with that component.

Example 1: No POS

language: en

pipeline:
- name: SpacyNLP
  model: "en_core_web_sm"
- name: SpacyTokenizer
- name: rasa_nlu_examples.meta.Printer
  alias: printer before
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit"],
    ["low", "title", "upper"],
  ]
- name: rasa_nlu_examples.meta.Printer
  alias: printer after
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 1

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

Example 2: With POS

language: en

pipeline:
- name: SpacyNLP
  model: "en_core_web_sm"
- name: SpacyTokenizer
- name: rasa_nlu_examples.meta.Printer
  alias: printer before
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper", "pos"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper", "pos"],
  ]
- name: rasa_nlu_examples.meta.Printer
  alias: printer after
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 1

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

Results

When you train these pipelines you can see that indeed it adds features to the pipeline.

Results without POS

printer before
text : yo
intent: {'name': None, 'confidence': 0.0}
entities: []
text_spacy_doc: yo
tokens: ['yo', '__CLS__']


printer after
text : yo
intent: {'name': None, 'confidence': 0.0}
entities: []
text_spacy_doc: yo
tokens: ['yo', '__CLS__']
text_sparse_features: <2x18 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in COOrdinate format>

Results with POS

printer before
text : yo
intent: {'name': None, 'confidence': 0.0}
entities: []
text_spacy_doc: yo
tokens: ['yo', '__CLS__']


printer after
text : yo
intent: {'name': None, 'confidence': 0.0}
entities: []
text_spacy_doc: yo
tokens: ['yo', '__CLS__']
text_sparse_features: <2x122 sparse matrix of type '<class 'numpy.float64'>'
        with 14 stored elements in COOrdinate format>

You can see that the lexical feature with pos added really do add features. The sparse feature matrix is now large in shape but we’ve only added 1 non-zero component for each token (yo and __CLS__, the __CLS__ token represents the entire sentence). This implies that the POS features are added as sparse features to the pipeline.

koaning · August 19, 2020, 1:17pm

The question now remains what is happening with FastText and spaCy for your use case. My pipeline was English and I was not using FastText vectors inside of spaCy. I’m not 100% sure how spaCy deals with the FastText vectors with regards to the POS tagging.

Could you try running the same pipeline using the FastText component found in rasa-nlu-examples while also using the base spaCy model that does not have FastText added to it? You can use the printer inside of the examples project to also get a glimpse of what features get added during what phase of the pipeline.

koaning · August 19, 2020, 1:21pm

Another option that’s worth mentioning, if you feel like FastText is perhaps just a bit too heavy. We also support BytePair embeddings that are a lot more lightweight. There’s a lot of options to pick from as shown here.

Lindafr · August 19, 2020, 1:54pm

No problem, I’m just really interested in the “tummy of the Rasa”.

I investigated all the possibilities of adding Estonian model from the docs. It seemed to me that FastText is the only option (I didn’t know about the BytePair embeddings). Heaviness is currently not a problem and I’ll try to continue using that. But I’ll keep the other option (thank you!) in mind for testing all possible Rasa configurations later.

Regarding my problem of Estonian FastText in spaCy. I chose this because currently we do not have spaCy for Estonian (well, we used to have, one master’s degree student did it for his thesis, but it got lost). This, at least, is giving me the word vectors (and I like it a lot!). As my testing shows, adding FastText to spaCy does not miraculously create lemmatization and POS-tagging for that spaCy. I am currently thinking of using stanza lemmatization and POS-tagging somehow in the pipeline, but do not want to change much of the code (more changes means more chances of errors and mistakes).

I shall try to do the experiments you suggested on Friday. What do you mean by base spaCy? For Estonian there is none.

koaning · August 19, 2020, 2:45pm

Ah, I assumed for Estonian there was a spaCy model pre-trained. As far as I know FastText won’t cause spaCy to suddenly understand POS.

Another thing, just to let you know, I’m currently adding gensim support to the Rasa examples repo. This should allow you to also train your own embeddings more easily for Rasa.

If there are other items missing (maybe there are other POS-tagging libraries out there), feel free to add a ticket to that repo.

koaning · August 19, 2020, 5:42pm

Come to think of it. @Lindafr are you interested in POS with the idea of doing entity detection? Or just POS in general? I was thinking about it and wondered … most entities would be NOUNs anyway? Maybe just a system that can detect NOUNs might suffice.

Lindafr · August 21, 2020, 6:31am

Hi, @koaning! I added an idea under the issues in the rasa_nlu_examples repo. Although I guess it’s worth considering it in the rasa repo as well. I am interested in POS in general. I thought that if it is used, it’s probably useful and I would benefit if I had this also. For sure I want to add lemmatization for Estonian, since it is mostly so agglutinative language.

I think you are right about the guess that most entities would be NOUNs anyway. And as I see, usually the “pos” option is left out and only

  "features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit"],
["low", "title", "upper"], ]

are used. (It is so even in the source code(?)). Maybe it’s not that important then and I can leave it out.

koaning · August 21, 2020, 8:03am

The reason why it is left from the original is probably due to the fact that it will only work if you’ve got a spaCy component in your pipeline (which a lot of pipelines/languages don’t have).

I’ve added a comment to the GitHub repo. I like your proposal .

welly87 · August 24, 2020, 2:44am

Hi @Lindafr, do you have any result of your experiment ? Any luck ?

welly87 · August 24, 2020, 3:21am

Hi,

Just want to give some clarification so everyone get the benefit on this thread. I do really think the content of this forum thread super useful. A couple of thing that i notice.

I think we need to update Language Support docs as it already obsolete by Choosing a Pipeline. It stated on this statement that supervised_embeddings is deprecated

So basically the content of the Choosing a Pipeline is the way to go if we want to change the language.

That’s why @koaning suggest that 2 options. I also think it would be really cool if we separate both options clearly on that docs using a bold title or something similar. So it’s clear if we want to give some direct link. The if statement is very good separation of concern there.

01. Pre-trained word embeddings approach using SpacyFeaturizer (Components)

More detail explanation can be found in this blog post. Everything @Lindafr comments also can be useful for this approach.

2. Not using pre-trained word embedding There’s some hint in this docs. Here’s the snippet.

Then follow the guide from @koaning from this blog post. You will found there are 3 featurizers you can use from here. You can find the sample and github from that page.

I hope that would help. I’m not sure this should be in the core docs or tutorial section. I’ve already read all the docs in your pages and it’s very awesome for a newbie like me.

I can create the tutorial on this one and post it here to summarize the journey after i’m done.

Rasa rocks!

Lindafr · August 24, 2020, 7:32am

Hi @welly87, I didn’t conduct the experiments atm, because koaning suggested them assuming we already have spaCy for Estonian.

koaning · August 24, 2020, 2:01pm

I think the BytePair embeddings or FastText embeddings might still contribute to better entity/intent scores. That said, yes, no POS features. Yet. There’s a ticket on GitHub now that I might be able to work on in the next two weeks.

Lindafr · August 25, 2020, 4:55am

I think so too.

Topic		Replies	Views
Word embeddings and RASA NLU Rasa Open Source	5	2027	August 10, 2020
Using fasttext pretrained word emmbeding for other language Rasa Open Source	3	999	August 18, 2020
Dense word-embeddings with RASA (spaCy) Rasa Open Source	4	929	February 4, 2021
Foreign language (not English) problem Getting Started with Rasa	6	343	January 28, 2021
Rasa bot for low resource languages Rasa Open Source	4	353	March 12, 2021