Rasa NLU - Understanding Training Data

I am having a hard time understanding training data in rasa nlu. Say I want to have training data where someone is informing someone of animals they can buy. For clarity I’ll use markdown format:

Say the user is hypothetically responding to a question:

"What kind of animal would you like to buy?"

There are only so many different ways of saying you want to buy something. So take the below example:

##intent:inform
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)

Would I need to repeat this for every type of animal I intended to handle? Like below?

##intent:inform
- [cat](animal)
- [dog](animal)
- [parrot](animal)
- buy [cat](animal)
- buy [dog](animal)
- buy [parrot](animal)
- I would like to buy a [cat](animal)
- I would like to buy a [dog](animal)
- I would like to buy a [parrot](animal)

Also, I noticed that in rasa’s restaurant bot, they sometimes repeat the same example over and over again, sometimes up to seven times, like below:

##intent:inform
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)

Why is that necessary? What affect does this have on the understanding? How would more occurrences of the same single word in the same position be an indicator that it is an appropriate response, especially if you had something like the below where a different value of the same entity was repeated the same amount of times?

##intent:inform
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- buy [dog](animal)
- I would like to buy a [dog](animal)

I am curious about the above because I see it in the old format in franken_data.json. To get a better understanding do q quick search on “text”: “cheap” You should get 14 of the same results.

Thank you, any advice is appreciated.

1 Like

No. But you need sufficient(i.e. test it out and see!) number of examples. See also chatito

Not sure of duplicates - my recollection is that this is someone elses provided data.

1 Like

hello, i have trained the model with different intents where i got training_data.json in that we have perfect intents and entities intent examples: 10 (5 distinct intents) - Found intents: ‘budget’, ‘customers’, ‘revenue’, ‘Debt/EBITDAX’, ‘production’ - entity examples: 10 (4 distinct entities) - found entities: ‘verb’, ‘value’, ‘time’, ‘speed’ After that i was tested same sentences which i was included in training where i’m getting wrong intents and entities are in empty list… pprint(interpreter.parse(“54% revenue growth in the last two quarters to $10m”)) here i must get revenue as my intent but its showing production { “intent”: { “name”: “production”, “confidence”: 0.5166328781001349 }, “entities”: [], “intent_ranking”: [ { “name”: “production”, “confidence”: 0.5166328781001349 }, { “name”: “revenue”, “confidence”: 0.1783357881550128 }, { “name”: “Debt/EBITDAX”, “confidence”: 0.14805304182717913 }, { “name”: “customers”, “confidence”: 0.08582269154352269 }, { “name”: “budget”, “confidence”: 0.0711556003741502 } ], “text”: “54% revenue growth in the last two quarters to $10m” }

regards, alpes

As @deepshet said, you don’t need to provide every single example, but you do need to provide a variety of examples. The linked tool called Chatito should help you with this.

As for duplicates, yeah this is the bAbI dataset converted into our format. The duplicates don’t change anything in accuracy.

1 Like

Hello, Could I see the babi dataset converted ? I have the original (facebook format) and a ParlAI converted version, but I don’t see Rasa’s