Support for code mixed languages?


(Aashishgangwani) #1

Hey, I was wondering if we can have Rasa NLU to identify code mixed languages like Singlish? Reference: https://en.wikipedia.org/wiki/Singlish


(Neil Stoker) #2

Have you looked here? (“always check the docs first!!” :slight_smile: )

There’s some detail there that would be relevant.

It might be worth having a go with the tensorflow_embedding back end as although it doesn’t mention code mixed input, the documentation does mention this:

The tensorflow_embedding pipeline can be used for any language, because it trains custom word embeddings for your domain.

Since this approach doesn’t use pretrained language data you might be okay, although bear in mind that words you want it to learn would need to be well represented in your training data (it’s not magic!)

If you wanted to try the other backends, I think you’d find it much harder. Whist the spacey backend can work with different pre-trained word vectors, you’d need to source a Singlish one yourself (unless someone made one already)

Finally, it’s only tangentially relevent, but I did spot this paper about creating code mixed datasets: [1806.05997v1] A Dataset for Building Code-Mixed Goal Oriented Conversation Systems


(Asim Zaman) #3

@nmstoker Its not good to place my question here but I need urgent response, how can I define intents for QnA chatbot for LMS, like student can ask question on subject of Computer Science etc???


(Aashishgangwani) #4

Thanks!


(Aashishgangwani) #5

How about one intent per question? I am not sure but that could be a way.


(Neil Stoker) #6

@asimzaman you could try an intent per question as @aashishgangwan1 suggests, it’ll be fine for a narrow set of known questions, but if you want it to scale then that’s where it is awkward (you’ll need to add new intents for each question)

It sounds like you’re under time pressure so this may be a little ambitious, but tools like DrQA are looking into generalisable approaches that search a knowledge base. A term to Google on this field is KBQA (Knowledge Base QnA). The DrQA repo is here: https://github.com/facebookresearch/DrQA

The sort of thing you could possibly do in a more limited timeframe using Rasa might be to create intents for your broad categories of questions plus off-road topic questions. Then for the on-topic ones use user keywords to search a relevant knowledge base (eg stick your questions and answers into something like ElasticSearch or even a sqlite dB)

As a side note, stressing the urgency doesn’t tend to make people more likely to respond in forums (unless they know you or it’s live threatening! :slightly_smiling_face:)

Best of luck with working on a solution - I’m sure others would like to know what you manage to create, so once you’ve made progress, why not post it on the Projects section?


(Neil Stoker) #7

Also there’s a video for a project I did a while back that may be slightly relevant for both of you (in different ways) https://youtu.be/xSN5fY5uYYg

It handles language detection (but sadly not code mixed language!) and then categories a variety of questions into brief topics (in this case academic subjects, but for @asimzaman it could be sub-groups of questions). There’s no “answering” backend, so it’s not helping you there @asimzaman but it might give some insight with intent examples

The code is on GitHub too