I am working with a chatbot project which involves Rasa NLU mainly.
It involves chatbot for FAQ support.
As some of the FAQs are quite similar in terms of language, once number of intents (type of questions) are more, NLU gives lower confidence level.
e.g. Initially I tested with 2 intents and the confidence level was near 70 to 90%. Now the project has 80 intents, the confidence level is near to 8 to 10%.
I want to know in this kind of scenario, where around 80 intents are there, some of the intents are quite similar in nature, what should be the threshold for the NLU?
Do you have a test set? if not then use -3-fold cross-validation
For overfitting and underfitting, you should look at the F1 Score of test vs train
if test is low and train is high you have overfitting or underfitting the other way around.
I think F1 score is a good baseline that can indicate how good is your model.
Also is your dataset balanced? we have over 140 intents and most of them are not well balanced and ideally now we are trying to merge some classes and figure out programmatically how to deal with the difference. Since you are doing FAQ, one good reminder as I have seen from tests with real users is people don’t interact with FAQ pages the same way they do with a chatbot. So If you have reused the same classes for your chatbot, It won’t really make sense in the end.