Unknown word extracted as entity

I have trained a bot to recognize an intent show_me_X, with examples like:

hello, show me [dresses](product), please
can you show me [skirts](product)?

Now, when I feed to the bot with a sentence similar to the examples above, but where the product is a completely unknown word for the bot (e.g., show me cars or show me drones), the result is not always good:

  • In some cases the bot predicts the nlu_fallback intent and extracts the unknown-word as product entity.
  • In other cases the bot predicts the show_me_X intent and extracts the unknown-word as product entity.

The second case seems to happen when the unknown word has the same root as one of the known products (i.e., like drone and dress). I wonder if this is due to using a char-based CountVectorsFeaturizer, although the docs says it is only used for intent classification and response selection:

pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 70
  use_masked_language_model: True
- name: FallbackClassifier
  threshold: 0.7  
- name: EntitySynonymMapper

If the problem is the featurizer, I could remove it but I would prefer not, since it helps to understand slight variations of words that are not entities. So, how could I prevent my bot from extracting a product entity when the user did not provide a known product? Would it help combining several entity extractors?

Hi @humcasma :wave: how many training examples do you have in your NLU data for your product entity?

Hi @m.vielkind As part of my test I am using a training data generator. Since I have quite a few products, quite a few entities and quite a few ways of expressing the intent, I am generating around 1000 training examples.