Customize OOV_token in CountVectorsFeaturizer?

Hi,

As per rasa documentation for OOV_token

the training is performed on limited vocabulary data, it cannot be guaranteed that during prediction an algorithm will not encounter an unknown word (a word that was not seen during training). In order to teach an algorithm on how to treat unknown words, some words in training data can be substituted by the generic word OOV_token . In this case, during prediction, all unknown words will be treated as this generic word OOV_token .

I have a similar situation where the user asked the wrong intention question(Not related to which bot is made for) for which bot recognize the trained intent.

example

Food restaurant bot. Negative scenario

NLU is trained for - Tell me about any nearby Indian restaurant?
Answer - The nearest Indian restaurant is at church street and 2km from your location.


User - Tell me more about any nearby Indian community centers?   (this is out of scope question but have similar words.)

Bot reply - The nearest Indian restaurant is at church street and 2km from your location.

We try to add this question to outofscope but like this, there can be many scenarios where user can ask outofscope question which has similar intent in the trained model.

After going through OOV_token it might be useful if i add community center in OOV_token and as outofscope intent then if the user asks the same question then it will fall back in outofscope . But the issue is the same how many keywords should i add ???

Is there any option where i can add a list of words and when rasa nlu finds an untrained word then if looks in the list, if doesn’t exist then choose 2 fallback policy or didn’t pick the trained intent.???

@piyush29programmer you can also add sentence examples with the oov token directly to your training data. Then you don’t need to list keywords.

So you could have e.g. “Tell me more about any nearby Indian oov oov?”