So here is the general scenario I am looking at and I don’t know how to do it so I am looking for guidance. We have a list of about 35 intents currently and a bunch of training utterances for each. I have trained a model.
Now what I am wondering is what if I have a user enter an utterance which is completely different than the material which I have trained on. So I don’t want to force one of my 35 intents on this utterance, I want something like “No Intent” to come back to indicate that none of the intents is a good match. I also don’t want to create a new intent for “off-topic” utterances. When I run from evaluate.py and am trying to score the performance how could I indicate that no intents match that utterance and then get credit for successfully saying that none of the intents is relevant to the utterance.
It seems to me that you can do a couple of things (in tandem) to resolve your issue.
first: flesh out your NLU file. The more examples you have per intent, the more closely the model will learn which phrases match a particular intent. This’ll help drive the confidence level for “out of scope” phrases down.
second: specify a cut-off threshold in your core training configuration that specifies a particular confidence level for your model, for example, here’s mine:
My fallbackpolicy returns a fallback action (utter_fallback: I don’t know what you’re saying! rephrase!) if the confidence threshold for my NLU dips under 0.2.
This is quite low because my bot is dutch and the dutch language model is… sketchy. You’d need to adjust your level to match a good cut-off point.
Remy, thanks for the comments. They both make sense to me.
First, that is one of things I have identified needing to do, increase my training utterances. I will definitely do that. Secondly, when working with the core, I see how it would work with the confidences.
But for now I am working on NLU in isolation and I am not responsible for coming up with the conversational model part. Is there a way to do when focusing on NLU alone? Or is NLU always going to return one of the intents?
The NLU will always try to match an intent to your utterance. Look at it like this: You are creating a language for your bot. the domain of that language, and everything of that language your chatbot knows is in your NLU. Therefore, if you say something that is not in that domain, the bot thinks you’re speaking its language, but badly. It has to try and figure out what you mean, in his language.
A more practical example: We’re understanding each other in English right now. So when I say:
“The boat has to be filled with fish so it is able to come ashore”
you think you understand what I mean, but it is nonsensical in this context. Therefore, you try to link it with our conversation and you label that phrase as ‘explains_NLU’. However, you’re not entirely confident that I am actually explaining NLU to you with that phrase.
The chatbot does the same.
So now I say to you: Only if you’re 50% certain or more, you may respond to my phrase. (Thus, I set a confidence level of .5 for you) Would you say something in response to that phrase I just sent you? If not: you labeled it as “out of scope”. If yes: you label it as your top ‘meaning’(intent) and you respond.
And now I realise I never came around to ‘fixing’ your problem:
the NLU just classifies and it works with the data you provide it. It doesn’t decide if something is out of scope or not, YOU do, as a developer. You’ll always have to have a different engine that decides what to do with the data the NLU feeds it.
Thanks for confirming that NLU does always return an intent. I see that in testing if I say “intent”:"" it will just skip it and not evaluate it which may be my best option for now.
What you say makes sense, however what I am saying is that there is an implicit dependency in rasa NLU needing rasa core. Sure I could swap out something else for rasa core and test in unison. But I should be able to test at the NLU level for correct behavior and that is how I am left unsatisfied. Perhaps there is a way to modify or customize the evaluate so that it looks at confidence levels and not intent matches.
Haha yeah! You got it! I’m not an expert at any of this. I think you might be able to customize the NLU pipeline to do what you want but I’m not much of a programmer so I wouldn’t exactly know how to help you with that!
There’s another trick you can try. This is to have an “out_of_scope” intent that you would fill with random chitchat that’s unlike any of your 35 specified examples. Then, especially if you have a lot, of examples for your “real” intents, anything that is “out of scope” will match the “out_of_scope” intent.
Good one. Just a note: You’ll have to think carefully about adding this. Here’s why:
If you ever want to add an intent you’ll have to check the out_of_scope examples again. If there’s an example that matches the ‘new’ intent, you’ll have to move it to avoid NLU confusion.
You’ll have to make absolutely 100% sure that the out_of_scope intent contains examples that match the other intents for a really small percentage, to avoid confusion.
I would only use an “out_of_scope” intent to rule out the obvious stuff like “I want pizza” (if you’re not doing a pizza delivery app, of course) or “I hate you”, that kind of stuff. It also makes your bot better if you answer differently when the user says this than when the bot fails to understand the user on relevant stuff.
If you are not happy with your NLU always returning an intent for your application, I’m assuming your application is written by you; if you are writing a chatbot, the configuration for Rasa already enables you to put a threshold on the required confidence that the intent classification has to have to be considered as a valid classification. Otherwise, if you are using NLU for other purposes, the NLU returns a JSON object, so you can always write up a quick function that extracts the intent classification confidence of that JSON object and triggers whatever action you want to use NLU for given that the confidence goes above the threshold, and performs a fallback action otherwise.
These are better to incorporate as small talk intents that have custom answers. It’s always better to make the bot respond accurately than saying “I dunno” all the time.
@Remy I am aware of this as a best practice in theory, but in practice I am still struggling with creating a bot that has 20+ intents with decent NLU predictions. Using a data generation tool and 13 intents, I can get something reasonable, but already with 14 intents I am starting to have problems. I tried the tensorflow pipeline with less good results than with the spacy NLU. So currently I am not recommending someone who wants to do a bot for a specific purpose to start putting functionality into the out of scope part because it’s just heavy load on the NLU. Of course I am not an expert either when it comes to that since I haven’t achieved a chatbot with more than 20 intents that actually looks good and is easy to manage, so it’s just my personal suggestion, not a best practice. I already asked this question in another topic where I am discussing my ideas, so I won’t elaborate here…
I catch your drift, our bots usually have no less than 80 intents (including the small talk one) so it might be less of a problem for us. You might just need more examples for the intents your are actually using.
80 intents?? That’s insane. How many training examples do you have for each? I don’t even understand how to get a thousand examples of an intent called “/affirm” and still keep a good bot… the problem doesn’t lie in generating the examples themselves, sometimes they just don’t exist.
We don’t have 1000 examples per intent We’ve generated about 50 examples per intent, and our users will generate the rest. It works, kind of. Just implement the TwoStageFallback to have users classify intents for your bot.
Core: depends on the parameters I have at that moment, but our base model takes about 15 minutes to train Keras and another 22 to write the generated stories.
@Remy : Can I try your bot somewhere? I’m really curious about what kind of quality you can achieve with so few examples. Do you have many users? And are you using the standard spacy pipeline/tensorflow pipeline or did you tweak anything?
It’s a dutch bot we are making voor clients, so I sadly cannot. The quality is not that great yet, though it kind of works. Thing is, we’re still developing and our users are testing more than actually using it. We’re adding examples to the NLU on the go.