My NLU file is large. Not too many of intents, but intents with thousands of examples. The problem is that intents with few examples are no more detected. the intent “stop” contains the example “stop”, and the intent “deny” contains the example “no”.
However, typing “stop” or “no” get a confidence score in intent classification as low as 0.2 (and nlu_fallback gets activated).
How to fix this problem?
I think a partial solution is to make some examples like stop or no map directly to an intent. Something like KeywordIntentClassifier, but with the ability to specify which intents to load (and not load everything).
The other alternative is to up-sample somehow the examples of the minority classes.
The first solution to class imbalance is always to correct it by collecting more data for underrepresented intents.
If an intent is there just to be triggered by specific words, you can rather use buttons with hard-coded intents e.g. payload: /deny. You can correct some imbalance with sampling techniques or hyperparameter tuning, but an imbalance that large indicates something is wrong with the structure of the data itself.
How are you collecting examples for your intents? Do they come from real conversations, or are they synthetic?
@mloubser The source of imbalance is the FAQ handling. If you have 3000 topics, with 3-6 questions each, what can you do?
How many alternatives regarding yes/no/stop can you gather? The questions to the FAQ are synthetic (I wrote them all), and they are handled by the response selector (with a success rate around 80%).
I also have this problem due to an intent having multiple possible values for an entity, all of which are combined with the different ways to phrase the intent, leading to a combinatorial explosion. See Confusion in Using Entity Synonyms - #2 by ivanmkc
Is there a hierarchy in your FAQ questions? Typically we have the example that “chitchat” is a different set of responses than “FAQ”. In your case it might also be possible to split up the FAQ questions into subgroups. Might that help?
In general; getting a score to properly represent certainty is a huge unsolved problem in ML. There’s an algorithm whiteboard video here that highlights recent work done from our research team on the topic. We recently introduced some hyperparameters for DIET that might help. There’s also a PyData talk here that highlights how estimated probabilities are not a great proxy for certainty, which might help explain why it’s a hard problem to get right.