What is the best way to handle variations within training data?

rasa-nlu

(Jack) #1

So for example

“How would I pin [application] to the taskbar”

where [application] could be Word, Excel, Outlook etc.

would it be best to hard create multiple variations such a:

“How would I pin word to the taskbar” “How would I pin excel to the taskbar” “How would I pin outlook to the taskbar”

Or is there a better way?


(Akela Drissner) #2

Yes, there’s a tool called Chatito for this as well


(Jack) #3

Yes that is the best way to do it or yes there is a better way to handle variations such as with entities?


(Jack) #4

Bump sorry.


(Akela Drissner) #5

This is the way to do it


(Jack) #7

So to clarify and finalise for future viewers the best way to train data with variants is to train it with duplicates.

how would I pin to word the taskbar
how would I pin to excel the taskbar
how would I pin to powerpoint the taskbar

Would this not causes issues with too many stop words?


(Akela Drissner) #8

No it shouldn’t cause issues. You don’t have to provide every possible value in there but some examples need to be there


(Datisto) #9

Is it important to take care of balance of each variation in examples too? Tensorflow embeding uses also the count of each word for classification? So if you have of one variation more examples for an intent, does this have an impact? So with the example above (what I mean exactly): I might have intent examplkes of one type like how would I pin to powerpoint the taskbar more often

than another type (with other vocab words).


(Akela Drissner) #10

It shouldn’t have a huge impact, unless you have 100 examples of one thing, and only 2 of another


(Jack) #11

What do we do if we do have 100 examples of one thing and only 2 of the other?

This is what I’m getting at, should there not be a more exact way of adding variants without having to balance the number of example expressions across all intents?


(Akela Drissner) #12

Add more examples to the intent with only 2 examples, it would be impossible for an NLU model to accurately predict the intent with such few examples. You don’t have to exactly balance them, but there should be a reasonable amount of examples for each intent