Multilingual Chatbot for Indian Languages

We are trying to create multilingual NLU for Indian Languages. Purpose of the NLU is to understand message from users for Booking LPG Cylinder or request for Mechanic visit. But twist is users can send message in any Indian language. We are preparing different intent for each language. Below giving one example of LPG Gas booking intents in different languages. Also I have added one more intent of out_of_scope where I am putting all out of scope messages in all language that could come from users. SO simply we would reject those messages.

NLU Content:

- intent: GASBOOKING_gu
  examples: |
    - હું ગેસ બુક કરવા માંગુ છું
    - મારે સિલિન્ડર બુક કરવું છે
    - મારે ગેસ સિલિન્ડર બુક કરવું છે
    - મારા માટે ગેસ સિલિન્ડર બુક કરો
    - કૃપા કરીને ગેસ સિલિન્ડર બુક કરો
    - કૃપા કરીને ગેસ બુક કરો
    - ગેસ સિલિન્ડર બુક કરો
    - શું તમે મારા માટે સિલિન્ડર બુક કરી શકો છો?
    - બુક ગેસ
- intent: GASBOOKING_mr
  examples: |
    - मला गॅस बुक करायचा आहे
    - मला सिलिंडर बुक करायचे आहे
    - मला गॅस सिलेंडर बुक करायचा आहे
    - माझ्यासाठी गॅस सिलिंडर बुक करा
    - कृपया माझ्यासाठी गॅस सिलिंडर बुक करा
    - कृपया गॅस बुक करा
    - कृपया गॅस सिलेंडर बुक करा
- intent: GASBOOKING_hi
  examples: |
    - मैं गैस बुक करना चाहता हूं
    - मैं सिलेंडर बुक करना चाहता हूं
    - मैं गैस सिलेंडर बुक करना चाहता हूं
    - मेरे लिए गैस सिलेंडर बुक करो
    - कृपया मेरे लिए एक गैस सिलेंडर बुक करें
    - कृपया गैस बुक करें
    - कृपया गैस सिलेंडर बुक करें
- intent: GASBOOKING_en
  examples: |
    - I want to book a gas
    - I want to book a cyl
    - I want to book a cylinder
    - I want to book a gas cylinder
    - Book  gas cylinder for me
    - Book  gas for me
    - Book a cylinder for me
    - Kindly Book a gas cylinder for me
- intent: GASBOOKING_kn
  examples: |
    - ನಾನು ಗ್ಯಾಸ್ ಬುಕ್ ಮಾಡಲು ಬಯಸುತ್ತೇನೆ
    - ನಾನು ಸಿಲಿಂಡರ್ ಬುಕ್ ಮಾಡಲು ಬಯಸುತ್ತೇನೆ
    - ನಾನು ಗ್ಯಾಸ್ ಸಿಲಿಂಡರ್ ಅನ್ನು ಬುಕ್ ಮಾಡಲು ಬಯಸುತ್ತೇನೆ
    - ನನಗೆ ಗ್ಯಾಸ್ ಸಿಲಿಂಡರ್ ಬುಕ್ ಮಾಡಿ
    - ನನಗೆ ಗ್ಯಾಸ್ ಬುಕ್ ಮಾಡಿ
    - ನನಗೆ ಸಿಲಿಂಡರ್ ಬುಕ್ ಮಾಡಿ
    - ದಯವಿಟ್ಟು ನನಗೆ ಗ್ಯಾಸ್ ಸಿಲಿಂಡರ್ ಬುಕ್ ಮಾಡಿ
    - ದಯವಿಟ್ಟು ಗ್ಯಾಸ್ ಬುಕ್ ಮಾಡಿ
    - ದಯವಿಟ್ಟು ಗ್ಯಾಸ್ ಸಿಲಿಂಡರ್ ಅನ್ನು ಬುಕ್ ಮಾಡಿ
 - intent: out_of_scope
   examples: |
     - How are you
     - what are you doing
     - i need help
     - i want gas papers
     - i need to book gas papers
     - i am looking for my gas papers
     - book my ticket
     - ਮੇਰੀ ਟਿਕਟ ਬੁੱਕ ਕਰੋ
     - ਤੁਸੀ ਕਿਵੇਂ ਹੋ
     - Tusi kivem ho
     - ਤੁਹਾਡਾ ਨਾਮ ਕੀ ਹੈ
     - உங்கள் பெயர் என்ன
     - നിന്റെ പേരെന്താണ്
     - તું શું કરે છે
     - તમારું નામ શું છે
     - મારી પ્રિય મૂવી ડાર્ક છે
     - ನನಗೆ ಕ್ರಿಕೆಟ್ ನೋಡಲು ಇಷ್ಟ
     - ನಿನ್ನ ಹೆಸರೇನು
     - ನನಗೆ ನಿನ್ನ ಸಹಾಯ ಬೇಕು
     - எனக்கு உங்கள் உதவி தேவை

Domain File NLU

intents:
  - GASBOOKING_en
  - GASBOOKING_hi
  - GASBOOKING_mr
  - GASBOOKING_gu
  - GASBOOKING_kn
  - out_of_scope

I am using below Pipeline as per RASA documents as you can see i am not using any pre trained model:

Config File

pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: “char_wb” min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 100
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100
  • name: FallbackClassifier threshold: 0.7

Please advise whether it would be correct approach to deal with multilingual conversations. Also how effective out_of_scope intent would be in such cases where we need to give many more example for out_of_scope intent.

Dear Ashutosh. We have a similar requirement in mind, but really not sure where to start on this. Looking forward for Rasa community support.

Interesting!

My name is Vincent and I’m trying to add more support for Non-English languages in Rasa. There are a few things that jump to mind but I’ll gladly hear it if I am missing something.

  • We support a language agnostic variant of Bert. It’s a pretrained model from google and looking at the appendix in the original paper it is suggested that indeed English, Hindi, Marathi, Gujarati and Kurdish are supported. In order to use it you’ll want to configure a LanguageModelFeaturizer with the rasa/LaBSE weights. Note that LaBSE is an abbreviation for Language Agnostic BERT. A downside of this approach is that it is very “heavy”. There’s a lot of compute time involved.
  • I maintain a project over at rasa-nlu-examples which supports many pre-trained word vectors that might also help. The BytePair embeddings hosted there are available in 250+ languages and could offer a more light-weight method of adding context to your pre-trained pipeline.You can find more info in the docs.

For my understanding though. It seems like you’re interested in making a single assistant that can handle many languages. So I wonder, what responses do you send? What language? Is there a reason why you’re not considering making multiple assistants, one for each language?

Hi Vincent,

You understood it correctly we are trying to make single assistant that can handle many languages(there are 22 major languages in India , written in 13 different scripts). Why we are considering it because our NLU is limited to very fix number of questions(LPG Booking and Mechanic visit, etc), so we are hoping that we can cover these questions in all languages with unique intent for each. Accordingly response will be given in the user’s language. Making multiple assistant for all languages is not the goal as you can see there would be many in that case.

I am not sure, do i need to give large number of examples in out_of_scope intent , because the problem I am facing currently is that, my trained NLU model is interpreting message, which should go into out_of_scope , as false positive (e.g .Let’s say user sent message 'I like watching movies' NLU interpretation is: GASBOOKING_en)… But once I define similar kind of examples in out_of_scope then it identifies correctly. I have taken example of English , but it is happing with all languages.

It’s working well actually except the false positive cases .

I would surely consider your suggestions.

The simple truth behind out of scope detection is that it is, certainly to my understanding, an unsolved problem. I’ve written down some technical details on why in this forum post but you might also appreciate this algorithm whiteboard video on fallback detection for some extra details.

One thing you might consider doing is to have multiple types of out_of_scope. If you have a look at our rasa-demo you’ll notice that you can pre-define many types of out-of-scope that should be detected. In your case, you might be able to have out-of-scope classes for each language.

This is a path that’s reasonable, but I wouldn’t spend too much time on it immediately. The fact that there are many out of scope situations imaginable doesn’t mean that they actually occur. It’s still best to look at examples from actual users as a source of inspiration for out-of-scope categories.

Out of curiosity, when you send the response to the user, how do you determine the correct language/text to send back? Is this handled by a custom action?

Ok , we are sending back two type of responses:

  1. When Intent is GASBOOKING_LANGUAGE, so from the last two character of intent name, we know the language code. (e.g. GASBOOKING_hi ) In this case response should go in Hindi.
  2. When intent is out_of_scope : in my case since there is only one out_of_scope category for all the languages. I am using polyglot library for language detection & then sending a appropriate message to user in detected language.

As you suggested making out_of_scope for each language, I guess that could be more effective.

Are you using a custom action that’s using polyglot to handle the responses?

Yes, using custom action which determines, language script.

example:

    from polyglot.detect import Detector
    msg_text='હું  બુક કરવા માંગુ છું '
    detector = Detector(msg_text)
    print(detector.language)