Hi All,
I’m reading about the related subject here https://rasa.com/docs/rasa/nlu/language-support/
Is there any sample about this one?
Cheers
Hi All,
I’m reading about the related subject here https://rasa.com/docs/rasa/nlu/language-support/
Is there any sample about this one?
Cheers
What language are you interested in? We may not have pre-trained word embeddings in every language but the base count vectorizer approach should work on any language that we can tokenize.
i’m trying to use fasttext pre trained… i see that it’s supported
is there any sample on how to use it? thanks alot
There are two options for fasttext.
rasa_nlu_examples
recently to make this process a whole lot easier. You can read the announcement here. It’s a side project that I maintain and the idea is that it is sortof a contrib
-like project. We have two word embeddings available from there that you can play with: fasttext and bytepair. The bytepair embeddings are availabe in 275 languages. For more information on how to set up fasttext via this route can be found here and you might also find the benchmarking guide useful.If you end up using the 2nd option, feel free to let me know on github if there’s any bugs/features you’d like me to consider.
nice… will check this now…
thanks a lot for your help.
cheers
Hi currently trying Option 2. Looks like it supported my language of choices.
Do you have any comparison fasttext vs bytepair? What’s the criteria to choose from both?
I see a couple of languages for my needs is covered in both…
Cheers
oh ya, is there any plan on merging the contrib into the core rasa? i think it’s an awesome stuff!
Cheers
i’ve found an issues on following benchmarking guide… has already post it there
thanks a lot
The idea behind the repository is that if it turns out that a feature is super useful then, yes!, we can move it into Rasa. But there’s a lot out there and Rasa needs to remain stable. That is why this repository was created. We can experiment just a bit more and get feedback this way.
It should also make it easier for you to write custom components because of the examples that are already there. Let me know if there’s features missing on github.
Thank for your help in github also.
Currently i’m trying to follow your instruction there.
Will let you know the result asap.
Cheers
Hi @koaning, It works now! Not sure why. I use bytepair featurize.
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: id
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
lang: en
vs: 1000
dim: 25
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
- name: MemoizationPolicy
- name: TEDPolicy
max_history: 5
epochs: 100
- name: MappingPolicy
Basically i just do modification from the new project and adding this one
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
lang: en
vs: 1000
dim: 25
I run rasa train. It works well.
So what’s next? I think i should try changing the data/nlu.md into indonesian language dataset? Then run rasa train again?
Am i in the right direction?
Cheers
Hi @koaning, basically i should make this language code the same right? What’s the different between the “language: id” and “lang: id”. how it will be used in the training part? Thanks
The id
at the top indicates the languages setting on a pipeline leven and in general I think it is supposed to be the same as the lang id
in the bytepair settings. I don’t know though if the two letter abbreviations in BytePair are the same what Rasa uses. I don’t know if there’s an international standard for so it’s good to manually check the BytePair website.
thanks. crystal clear.
confirm after seeing the documentation about this regarding the country code.
so the next step should be changing the data/nlu.md into Indonesian language dataset? Then run rasa train again?
Am i in the right direction?
just seen your reply at github @koaning posting that here. so everyone can see the end to end if they need. thanks a lot. marking this thread solved.
Hi!
Thank you for this clarification. I am currently working on making a pipeline for Estonian. I chose the first option and kept the suggested pipeline (here):
‘’’ # Configuration for Rasa NLU. # Components language: et_model
pipeline:
- name: SpacyNLP
- name: SpacyTokenizer
- name: SpacyFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
‘’’
I understand now that the CountVectorsFeaturizer is written twice on purpose. But going through the pipeline another question arose. In LexicalSyntacticFeaturizer one of the features it generates is
pos Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).
I can see that the only benefit from Fasttext wrapped into Spacy is the word vectors. Loading Fasttext model into Spacy doesn’t give me lemmatization nor POS-es.
Might this influence the performance somehow? I mean if Spacy doesn’t give POS-es, then I expect the LexicalSyntacticFeaturizer just gives NaN for every POS value…
The LexicalSyntacticFeaturizer
adds things like “does this word start with a capital letter”. This is very different from the POS features that spaCy generates.
The spaCy featurizer only adds the word vector features to my knowledge. It does not add POS information. I’d argue that it’s plausible though that POS features could make entity detection easier down the line. That is why there’s an open ticket on the Rasa NLU examples repository.
Just to be clear, the LexicalSyntacticFeaturizer
is not related to spaCy in our implementation.
Thank you for your quick answer, @koaning! In that case the docs are a bit wrong? Why else is there written that SpacyTokenizer (I am not talking about spaCy featurizer) is needed for POS values. When I look at the code in the lexical_syntactic_featurizer.py it imports SpacyTokenizer:
line 8 from rasa.nlu.tokenizers.spacy_tokenizer import POS_TAG_KEY
and it also takes POS tags from the token:
line 65 "pos": lambda token: token.data.get(POS_TAG_KEY)
line 66 if POS_TAG_KEY in token.data
line 67 else None,
line 68 "pos2": lambda token: token.data.get(POS_TAG_KEY)[:2]
line 69 if "pos" in token.data
line 70 else None,
In the spacy_tokenizer.py in line 40 it adds tag_ (which are the detailed part-of-speech tag according to spaCy docs) to the token.
Therefore it seems to me that the LexicalSyntacticFeaturizer
is related to spaCy (and gives None to the pos
and pos2
features otherwise) and if I want to get the maximum out of Rasa I have to consider the fact that implementing Fasttext via spaCy lacks of linguistic data (gives only wordvectors) and I should give POS-es and lemmas via other resources. What do you think?
@Lindafr I wasn’t aware of the pos
item in the docs. Interesting.
I want to double check this now. Will report back in a few minutes with an extensive answer. Odds are that you’re totally correcting me on something now! (Well done!)