Hey Nik, any ideas on how to integrate two languages in a bot, so that if someone types one language the bot is able to detect it and switch?
How about adding pipelines for local languages that are not supported in rasa?
Hey Nik, any ideas on how to integrate two languages in a bot, so that if someone types one language the bot is able to detect it and switch?
How about adding pipelines for local languages that are not supported in rasa?
I have personally not implemented such a use case lets us take help or suggestion from chris pinging @ChrisRahme for the help. Many thanks in advance.
Please see the following posts/thread:
Thank you @ChrisRahme
Do you have any implementation examples, maybe a moodbot with the advise you gave so I can run an have some more context?
There’s my chatbot, but it doesn’t use a custom component.
Instead, it asks the user which language they want to talk in at the start of the conversation. The bot will always understand 5 “languages” mentioned in the NLU, but will only respond in the language the user selected.
@ChrisRahme Thank you very much, let me have a look at it and get back to you. Thank you.
Hey @ChrisRahme
I wanted to know more about the way you chose the pipeline, I actually thought I would need to build custom word embeddings for the language I want to use, or is it possible to work with the default pipeline because the alphabet is like english only missing a few letters.
Hello @atwine ,
before you go down the rabbit hole of building custom word embedddings? which language are you building the bot?
There are already a lot of pre-trained embeddings in low resource languages available from spaCy, FastText and some variants of berts too.
Also the default self supervised embeddings can work if you have decent amount of examples per intent(say about 15-20) as long as the language has words which can be split using WhitespaceTokenizer, see docs on how it splits the token.
Hello @souvikg10
Thanks, the language am trying to build for is: Luganda (Ugandan local dialect.) Ideally my bot should work for English and Luganda. Luganda does have mostly the english alphabet characters and I think a white space tokenizer would do fine.
So you think I don’t have to try build custom embeddings?
You can try both
A. Try the Self supervised first. see if that fits your needs then you don’t need anything else
B. Enhance it with pre trained embeddings in luganda https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.lg.vec (DON"T CLICK ON THIS UNLESS YOU WANT TO DOWNLOAD THE VECTORS) and you can follow this project on how to import these vectors into your rasa project - FastTextFeaturizer - Rasa NLU Examples
All fastText pretrained vectors are here
Thank you very much, this is a great place to start. I am beginning with the part A. I have built a minimal bot that is able to work in English and Luganda, let me share so you can have a look. covid.yml (714 Bytes) eng.yml (574 Bytes) nlu.yml (1.7 KB) rules.yml (413 Bytes) stories.yml (2.2 KB) config.yml (1.4 KB) domain.yml (2.9 KB)
This is the output:
looks like it is working. well done!! some years back i worked on the swahili language with the same pipeline and my experience is for most short task flows, it does work quite well.
Thanks @souvikg10
I have a question, if i use spacy, (its the one I am using on my English bot with more than 100 intents), how will I combine it with this whitespace tokenizer thing? will i just add it in the pipeline just wondering
Does this pipeline make sense?
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline:
# No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# If you'd like to customize it, uncomment and adjust the pipeline.
# See https://rasa.com/docs/rasa/tuning-your-model for more information.
- name: SpacyNLP
model: en_core_web_md
- name: SpacyTokenizer
- name: SpacyFeaturizer
pooling: mean
# - name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
constrain_similarities: true
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
constrain_similarities: true
- name: FallbackClassifier
threshold: 0.3
ambiguity_threshold: 0.1
# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
# No configuration for policies was provided. The following default policies were used to train your model.
# If you'd like to customize them, uncomment and adjust the policies.
# See https://rasa.com/docs/rasa/policies for more information.
- name: MemoizationPolicy
- name: RulePolicy
- name: UnexpecTEDIntentPolicy
max_history: 5
epochs: 100
- name: TEDPolicy
max_history: 5
epochs: 100
constrain_similarities: true
you will need the same config for both language if you are to follow @ChrisRahme’s steps . Right chris?
Nice job for your bot, Atwine. And thanks for the help, Souvig
My bot used a single pipeline for all languages, and all the NLUs were mixed together. Your bot is already more advanced since it can detect the language on its own Mine can’t do that, so I couldn’t even switch configs if I wanted to.
Pretty sure you can use Spacy with the Whitespace Tokenizer, but I think it would be better to put it before any Featurizers.
Thanks team, let me take this direction for now, however I wonder if it will hold when the number of intents grow since now i will have to make two of each.
Ah so you went with making an intent per language.
This solution works smoothly but indeed the number of intents, stories, rules, and responses grows by N whenever you add a new language.
@ChrisRahme , I’m trying to build a multi lingual Tourism bot for , initially i created it for english language…how to implement other languages…as u told, the no.of intents, stories, rules and responses are growing by N…how to sort that…also i want to give real-time information for tourism… tourism covers many places, so how to take the data(16 intents per location) for each and every location, because it will be a huge data… how to manage the response for intents