I’m working on a bot for internal communication within a company. I have about 20 intents to distinguish, and the bot works quite well for most of them (0.900 F score). For each intent I have prepared upwards of 40 example phrases, and for ones where I expected to be difficulty, I’ve included more. Some intents are quite similar, and on these the bot makes some mistakes. The problem is that these mistakes seem quite trivial, as among the incorrectly classified intents are phrases which include something along the lines of keywords for the desired intent. I.e. I have a phrase containing word X, which occurs verbatim in the training set, and only in the dataset for the desired intent, and yet it gets incorrectly classified. As I’ve said, the bot works quite well on phrases I would consider harder, so it’s hard for me to make sense of it. Is there any recognizable factor which could be behind this?