We are trying to create multilingual NLU for Indian Languages. Purpose of the NLU is to understand message from users for Booking LPG Cylinder or request for Mechanic visit. But twist is users can send message in any Indian language. We are preparing different intent for each language. Below giving one example of LPG Gas booking intents in different languages. Also I have added one more intent of out_of_scope where I am putting all out of scope messages in all language that could come from users. SO simply we would reject those messages.
NLU Content:
- intent: GASBOOKING_gu
examples: |
- હું ગેસ બુક કરવા માંગુ છું
- મારે સિલિન્ડર બુક કરવું છે
- મારે ગેસ સિલિન્ડર બુક કરવું છે
- મારા માટે ગેસ સિલિન્ડર બુક કરો
- કૃપા કરીને ગેસ સિલિન્ડર બુક કરો
- કૃપા કરીને ગેસ બુક કરો
- ગેસ સિલિન્ડર બુક કરો
- શું તમે મારા માટે સિલિન્ડર બુક કરી શકો છો?
- બુક ગેસ
- intent: GASBOOKING_mr
examples: |
- मला गॅस बुक करायचा आहे
- मला सिलिंडर बुक करायचे आहे
- मला गॅस सिलेंडर बुक करायचा आहे
- माझ्यासाठी गॅस सिलिंडर बुक करा
- कृपया माझ्यासाठी गॅस सिलिंडर बुक करा
- कृपया गॅस बुक करा
- कृपया गॅस सिलेंडर बुक करा
- intent: GASBOOKING_hi
examples: |
- मैं गैस बुक करना चाहता हूं
- मैं सिलेंडर बुक करना चाहता हूं
- मैं गैस सिलेंडर बुक करना चाहता हूं
- मेरे लिए गैस सिलेंडर बुक करो
- कृपया मेरे लिए एक गैस सिलेंडर बुक करें
- कृपया गैस बुक करें
- कृपया गैस सिलेंडर बुक करें
- intent: GASBOOKING_en
examples: |
- I want to book a gas
- I want to book a cyl
- I want to book a cylinder
- I want to book a gas cylinder
- Book gas cylinder for me
- Book gas for me
- Book a cylinder for me
- Kindly Book a gas cylinder for me
- intent: GASBOOKING_kn
examples: |
- ನಾನು ಗ್ಯಾಸ್ ಬುಕ್ ಮಾಡಲು ಬಯಸುತ್ತೇನೆ
- ನಾನು ಸಿಲಿಂಡರ್ ಬುಕ್ ಮಾಡಲು ಬಯಸುತ್ತೇನೆ
- ನಾನು ಗ್ಯಾಸ್ ಸಿಲಿಂಡರ್ ಅನ್ನು ಬುಕ್ ಮಾಡಲು ಬಯಸುತ್ತೇನೆ
- ನನಗೆ ಗ್ಯಾಸ್ ಸಿಲಿಂಡರ್ ಬುಕ್ ಮಾಡಿ
- ನನಗೆ ಗ್ಯಾಸ್ ಬುಕ್ ಮಾಡಿ
- ನನಗೆ ಸಿಲಿಂಡರ್ ಬುಕ್ ಮಾಡಿ
- ದಯವಿಟ್ಟು ನನಗೆ ಗ್ಯಾಸ್ ಸಿಲಿಂಡರ್ ಬುಕ್ ಮಾಡಿ
- ದಯವಿಟ್ಟು ಗ್ಯಾಸ್ ಬುಕ್ ಮಾಡಿ
- ದಯವಿಟ್ಟು ಗ್ಯಾಸ್ ಸಿಲಿಂಡರ್ ಅನ್ನು ಬುಕ್ ಮಾಡಿ
- intent: out_of_scope
examples: |
- How are you
- what are you doing
- i need help
- i want gas papers
- i need to book gas papers
- i am looking for my gas papers
- book my ticket
- ਮੇਰੀ ਟਿਕਟ ਬੁੱਕ ਕਰੋ
- ਤੁਸੀ ਕਿਵੇਂ ਹੋ
- Tusi kivem ho
- ਤੁਹਾਡਾ ਨਾਮ ਕੀ ਹੈ
- உங்கள் பெயர் என்ன
- നിന്റെ പേരെന്താണ്
- તું શું કરે છે
- તમારું નામ શું છે
- મારી પ્રિય મૂવી ડાર્ક છે
- ನನಗೆ ಕ್ರಿಕೆಟ್ ನೋಡಲು ಇಷ್ಟ
- ನಿನ್ನ ಹೆಸರೇನು
- ನನಗೆ ನಿನ್ನ ಸಹಾಯ ಬೇಕು
- எனக்கு உங்கள் உதவி தேவை
Please advise whether it would be correct approach to deal with multilingual conversations. Also how effective out_of_scope intent would be in such cases where we need to give many more example for out_of_scope intent.
My name is Vincent and I’m trying to add more support for Non-English languages in Rasa. There are a few things that jump to mind but I’ll gladly hear it if I am missing something.
We support a language agnostic variant of Bert. It’s a pretrained model from google and looking at the appendix in the original paper it is suggested that indeed English, Hindi, Marathi, Gujarati and Kurdish are supported. In order to use it you’ll want to configure a LanguageModelFeaturizer with the rasa/LaBSE weights. Note that LaBSE is an abbreviation for Language Agnostic BERT. A downside of this approach is that it is very “heavy”. There’s a lot of compute time involved.
I maintain a project over at rasa-nlu-examples which supports many pre-trained word vectors that might also help. The BytePair embeddings hosted there are available in 250+ languages and could offer a more light-weight method of adding context to your pre-trained pipeline.You can find more info in the docs.
For my understanding though. It seems like you’re interested in making a single assistant that can handle many languages. So I wonder, what responses do you send? What language? Is there a reason why you’re not considering making multiple assistants, one for each language?
You understood it correctly we are trying to make single assistant that can handle many languages(there are 22 major languages in India , written in 13 different scripts). Why we are considering it because our NLU is limited to very fix number of questions(LPG Booking and Mechanic visit, etc), so we are hoping that we can cover these questions in all languages with unique intent for each. Accordingly response will be given in the user’s language.
Making multiple assistant for all languages is not the goal as you can see there would be many in that case.
I am not sure, do i need to give large number of examples in out_of_scope intent , because the problem I am facing currently is that, my trained NLU model is interpreting message, which should go into out_of_scope , as false positive (e.g .Let’s say user sent message 'I like watching movies' NLU interpretation is: GASBOOKING_en)… But once I define similar kind of examples in out_of_scope then it identifies correctly. I have taken example of English , but it is happing with all languages.
It’s working well actually except the false positive cases .
The simple truth behind out of scope detection is that it is, certainly to my understanding, an unsolved problem. I’ve written down some technical details on why in this forum post but you might also appreciate this algorithm whiteboard video on fallback detection for some extra details.
One thing you might consider doing is to have multiple types of out_of_scope. If you have a look at our rasa-demo you’ll notice that you can pre-define many types of out-of-scope that should be detected. In your case, you might be able to have out-of-scope classes for each language.
This is a path that’s reasonable, but I wouldn’t spend too much time on it immediately. The fact that there are many out of scope situations imaginable doesn’t mean that they actually occur. It’s still best to look at examples from actual users as a source of inspiration for out-of-scope categories.
Out of curiosity, when you send the response to the user, how do you determine the correct language/text to send back? Is this handled by a custom action?
When Intent is GASBOOKING_LANGUAGE, so from the last two character of intent name, we know the language code. (e.g. GASBOOKING_hi ) In this case response should go in Hindi.
When intent is out_of_scope : in my case since there is only one out_of_scope category for all the languages. I am using polyglot library for language detection & then sending a appropriate message to user in detected language.
As you suggested making out_of_scope for each language, I guess that could be more effective.
One way I would take on multi-lingual bot is to create a custom component which takes the user input and use LLM like openAI or open-source and convert it into english. That way you won’t have to train the model on every language.
OpenAI is pretty good at detecting many languages. It can have some edge cases failure. But would work in most of the cases.
Second way would be to make it button based for major actions like options to choose which service to take.