Building a bot for local language

@nik202

Hey Nik, any ideas on how to integrate two languages in a bot, so that if someone types one language the bot is able to detect it and switch?

How about adding pipelines for local languages that are not supported in rasa?

I have personally not implemented such a use case lets us take help or suggestion from chris pinging @ChrisRahme for the help. Many thanks in advance.

1 Like

Please see the following posts/thread:

Thank you @ChrisRahme

Do you have any implementation examples, maybe a moodbot with the advise you gave so I can run an have some more context?

There’s my chatbot, but it doesn’t use a custom component.

Instead, it asks the user which language they want to talk in at the start of the conversation. The bot will always understand 5 “languages” mentioned in the NLU, but will only respond in the language the user selected.

1 Like

@ChrisRahme Thank you very much, let me have a look at it and get back to you. Thank you.

1 Like

Hey @ChrisRahme

I wanted to know more about the way you chose the pipeline, I actually thought I would need to build custom word embeddings for the language I want to use, or is it possible to work with the default pipeline because the alphabet is like english only missing a few letters.

Hello @atwine ,

before you go down the rabbit hole of building custom word embedddings? which language are you building the bot?

There are already a lot of pre-trained embeddings in low resource languages available from spaCy, FastText and some variants of berts too.

Also the default self supervised embeddings can work if you have decent amount of examples per intent(say about 15-20) as long as the language has words which can be split using WhitespaceTokenizer, see docs on how it splits the token.

1 Like

Hello @souvikg10

Thanks, the language am trying to build for is: Luganda (Ugandan local dialect.) Ideally my bot should work for English and Luganda. Luganda does have mostly the english alphabet characters and I think a white space tokenizer would do fine.

So you think I don’t have to try build custom embeddings?

You can try both

A. Try the Self supervised first. see if that fits your needs then you don’t need anything else

B. Enhance it with pre trained embeddings in luganda https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.lg.vec (DON"T CLICK ON THIS UNLESS YOU WANT TO DOWNLOAD THE VECTORS) and you can follow this project on how to import these vectors into your rasa project - FastTextFeaturizer - Rasa NLU Examples

All fastText pretrained vectors are here

1 Like

@souvikg10

Thank you very much, this is a great place to start. I am beginning with the part A. I have built a minimal bot that is able to work in English and Luganda, let me share so you can have a look. covid.yml (714 Bytes) eng.yml (574 Bytes) nlu.yml (1.7 KB) rules.yml (413 Bytes) stories.yml (2.2 KB) config.yml (1.4 KB) domain.yml (2.9 KB)

This is the output:

1 Like

looks like it is working. well done!! some years back i worked on the swahili language with the same pipeline and my experience is for most short task flows, it does work quite well.

Thanks @souvikg10

I have a question, if i use spacy, (its the one I am using on my English bot with more than 100 intents), how will I combine it with this whitespace tokenizer thing? will i just add it in the pipeline just wondering

Does this pipeline make sense?

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en

pipeline:
# No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# If you'd like to customize it, uncomment and adjust the pipeline.
# See https://rasa.com/docs/rasa/tuning-your-model for more information.
  - name: SpacyNLP
    model: en_core_web_md
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
    pooling: mean
  # - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
    constrain_similarities: true
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
    constrain_similarities: true
  - name: FallbackClassifier
    threshold: 0.3
    ambiguity_threshold: 0.1

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
# No configuration for policies was provided. The following default policies were used to train your model.
# If you'd like to customize them, uncomment and adjust the policies.
# See https://rasa.com/docs/rasa/policies for more information.
  - name: MemoizationPolicy
  - name: RulePolicy
  - name: UnexpecTEDIntentPolicy
    max_history: 5
    epochs: 100
  - name: TEDPolicy
    max_history: 5
    epochs: 100
    constrain_similarities: true

you will need the same config for both language if you are to follow @ChrisRahme’s steps . Right chris?

1 Like

Nice job for your bot, Atwine. And thanks for the help, Souvig :slight_smile:

My bot used a single pipeline for all languages, and all the NLUs were mixed together. Your bot is already more advanced since it can detect the language on its own :slight_smile: Mine can’t do that, so I couldn’t even switch configs if I wanted to.

Pretty sure you can use Spacy with the Whitespace Tokenizer, but I think it would be better to put it before any Featurizers.

Thanks team, let me take this direction for now, however I wonder if it will hold when the number of intents grow since now i will have to make two of each.

Ah so you went with making an intent per language.

This solution works smoothly but indeed the number of intents, stories, rules, and responses grows by N whenever you add a new language.

@ChrisRahme , I’m trying to build a multi lingual Tourism bot for , initially i created it for english language…how to implement other languages…as u told, the no.of intents, stories, rules and responses are growing by N…how to sort that…also i want to give real-time information for tourism… tourism covers many places, so how to take the data(16 intents per location) for each and every location, because it will be a huge data… how to manage the response for intents