Improve Rasa NLU model

rasa-nlu

(Patrick Da Silva) #1

Hi everyone,

in our company, we are trying to build a chatbot product that will have multiple functionalities, which means we need to be able to handle multiple intents, ideally somewhere in the range of 50-100; we are well equipped to generate examples for that many intents. This sounds like a problem for the standard linear SVM provided by Rasa, which already starts struggling at 15 intents, even though we have 100-150 good examples per intent.

What would be a good place to start to improve the quality of the classifications? We are already using entities, synonyms, lookup tables, etc. This is our current pipeline, which is pretty standard :

language: "de"
pipeline:
- name: "nlp_spacy"
- name: "tokenizer_spacy"
- name: "intent_featurizer_spacy"
- name: "intent_entity_featurizer_regex"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_classifier_sklearn"

(Romain Huber) #2

Did you try the tensorflow pipeline ?

We are using this pipeline if you want to try :

language:"fr_core_news_md"

pipeline:
- name: "spell_check_component.SpellCheck"
- name: "nlp_spacy"
- name: "tokenizer_whitespace"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"

which combine the nlp_spacy + tensorflow.

Plus, I checked on spacy website and maybe you can use the medium package, which is de_core_news_sm](German · spaCy Models Documentation)

Hopes it helps, I’m not an expert in pipeline. If it doesn’t, you can maybe try to evalute your pipeline using a test set + rasa.train evaluate (Evaluating and Improving Models) to see where is the confusion.


(Patrick Da Silva) #3

I tried it, and so far it achieved poorer performance than the sklearn one. I think it’s because the language used is not that specific to the use-case, we’re just trying to make our chatbot talk about a lot of stuff and that means lots of intents, so maybe I just need an architecture that can handle more intents.

But that’s actually what the gist of my question was about : let’s say I want to make for the sake of having a complete example, an FAQ bot that can answer 20 questions. Let’s assume for simplicity for now that it’s a question-answer format, i.e. no form actions or multiple follow-up intents necessary to answer the question. So all I have to do is have some basic intents (affirm, deny, out_of_scope, greeting, goodbye) and one intent for each question, and the bot has to classify between those intents. But there are not so many ways to say “yes” (maybe 100-300 would make sense if you add in punctuation and that kind of stuff with a bit of creativity) but there’s definitely thousands, even millions of ways to ask for a question, especially if you start using data generation tools and expect entities to show up in your question.

In this case, I have the problem that I don’t think that keeping a roughly equal amount of examples per intent sounds okay, since even if I put all my examples for my basic intents, I get stuck somewhere below 300 examples for my questions for which I have much more data (and variability, so the more examples would be the better). I was thinking that having some hierarchical classification would be useful (i.e. having a broadly defined “ask_question” intent that would be a back-end intent only used to trigger a second classifier that would classify the question), but I also see several problems with that approach and I don’t know if they can all be solved. Before diving into code, I was hoping that someone with a bit of experience down that road could share some thoughts! It would be really nice to discuss.

P.S. : In German, there’s only one language model, it’s the small one (the one that you linked, de_core_news_sm). I’m happy to see that there’s a medium-sized language model in French and Spanish though, there hasn’t always been, this must be recent!


(Patrick Da Silva) #4

Update : I tried again, with more variability in the training examples and 2 more intents. I definitely have a scalability problem : my NLU (whether I am using the proposed the Spacy Pipeline or the Tensorflow Embedding pipeline) starts classifying stuff pretty wrong up to a certain point.

Two things are keeping me stuck right now :

  • With this workflow, let’s say that I am doing again an FAQ bot with tons of questions ( > 10), and at some point during the flow of my dialog while answering one question that I want to add, I expect the user to say something specific and I want to catch it with a new intent. Now I have to create a new NLU model for the whole bot, but I’m just trying to make a tiny modification. This doesn’t sound like a very scalable approach.

  • Again in the case of the FAQ bot, let’s say that I am doing a form action for a FAQ Question and I want to know something about if something is broken or not. So I want to write an intent called “working” (and another one called “not working”) where the user will tell me stuff like

  • it’s working fine
  • it’s not broken
  • yes, it’s working great!
  • no, it’s working
  • the device is working fine

and perhaps also put two buttons to catch the intent easier, but the user should always be able to type. But this intent is very close to my other intent, affirm (used as an answer to “was this helpful?” for example), which contains examples like

  • yes, you’re doing a good job!
  • yes, great job!
  • yes

and I want to catch the answer to that question with the intent “affirm” or “deny”. If I have to think about which intents are close and which ones are not every time I need a new intent to catch an answer to a question, I will go nuts as the number of FAQ questions I want to answer goes up; everytime I’d add one or two intents specific to one question, I’d have to worry about the whole bot. But the good thing is that when I know that the user is asking me a certain question, I have the context in the tracker, so there should be an easy way that this helps the AI-programmer to make an NLU that makes a correct classification without having to ask the non-AI programmer doing the question to re-build new datasets and produce new AI models.

So I was wondering if anyone figured out a way to scale NLU in that sense so that a programmer does not have to worry about those things and that the tasks of classification and designing questions can be separated. Maybe this is an open problem for now, but that’s why I am asking the community : I don’t know!

My current idea is to make multiple NLU interpreters and use Core to decide which one to use, but this will take me some programming and AI programming time. If anyone went in that direction, I would love to know how it turned out!