How to log sentences containing oov words and explicitely mark oov words?

Hey,

how would I go about logging oov words and the sentences they appear in? From what I can tell NLU doesn’t explicitely mark them.

Br Dan

I guess you’re using the tensorflow pipeline?

you have to add this to your config file:

- name: "intent_featurizer_count_vectors"
  OOV_token: oov

And then add a few sentences to your training data containing this oov token

By saying oov token do you mean i have to specify some unknown words under the entity name “oov” .

for example

“what’s my credit card number ?”

In the above input say i want the NLU to tag “credit” as oov , so do i have to set “credit” as an entity under the name “oov” ?

please clarify

Look maybe like I do:

and tell me, if you get same behaviour such that using OOV tokens lead to unexpected behaviour?

Sorry for being not clear enough.

From your answer I take it that I’d do something like:

## intent:mood_okay
- i am fine oov

Message: i am fine asckjnascjknsajcn
DEBUG:rasa_core.tracker_store:Recreating tracker for id 'xxx'
DEBUG:rasa_core.processor:Received user message 'i am fine asckjnascjknsajcn' with intent '{'name': 'mood_okay', 'confidence': 0.9735773801803589}' and entities '[]'

However, when I do that how do I recognize the oov word so that I can log it?

For example to get back:

DEBUG:rasa_core.processor:Received user message 'i am fine asckjnascjknsajcn' with intent '{'name': 'mood_okay', 'confidence': 0.9735773801803589}' and entities '[]', OOV_strings '["asckjnascjknsajcn"]'

Then I can take the whole sentence and the OOV_strings and save them in order to be able to add them later in case they made sense.

@Abir no you don’t need to label any entity, just add sentences like “what’s my oov card number” or “what’s my credit card oov”

@deW1 yes that’s how you add them. but the words don’t get logged that are found as oov atm. You can probably implement that in a custom way

thank you

@Abir. Have you tried it? What are your experience (compared to my post above)?

What i felt is using OOV can sometimes become a wildfire . If the % of data with OOV token is high , it’s more likely the nlu will tag every user input as out of bound . OOV can be used to a certain extent. I prefer creating new labels/intents to classify all garbage data. It worked out for me.