Because of the very specific domain we are covering intents could lie extremely close together. Even when a very ‘straightforward’ question is asked from which the intent should be easily detected (and the entities are detected correctly), the NLU threshold is not obtained, resulting in the chatbot asking to rephrase the question.
To create stories ‘chatette’ is used, which is a module that enables the creation of similar stories in terms of structure, changing some words to their synonyms. We than sample from the possible stories generated by this module in a way that enough sample stories are made for chatbot training, but the processing time is not impacted tremendously.
Is there a way to cope optimally with intents covering very closely related but dissimilar topics (like assigning more weight to core words representing the intents when detected in a story)?
this is a good question! You can use (regex features)[Training Data Format] to achieve what you want to some extent.
Probably the better approach is to do a hyperparameter search on the parameters of your pipeline. Say for example you’re using the tensorflow embedding pipeline. Split up the components:
And use a library like hyperopt to optimize those parameters with respect to your loss function. It might make sense to penalize confusion between closely related intents more heavily than other errors to achieve what you want.
Another possibility is to use (multi intents)[Choosing a Rasa NLU Pipeline] to construct multi-intents like main_topic+subtopic_1, main_topic+subtopic_2, etc.
Many thanks for the answer! These are definitely some interesting suggestions to try out I’ll keep you posted about the most efficient technique to solve the issue I am facing!