Rasa NLU - Understanding Training Data

adwade595 · August 25, 2018, 8:56pm

I am having a hard time understanding training data in rasa nlu. Say I want to have training data where someone is informing someone of animals they can buy. For clarity I’ll use markdown format:

Say the user is hypothetically responding to a question:

"What kind of animal would you like to buy?"

There are only so many different ways of saying you want to buy something. So take the below example:

##intent:inform
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)

Would I need to repeat this for every type of animal I intended to handle? Like below?

##intent:inform
- [cat](animal)
- [dog](animal)
- [parrot](animal)
- buy [cat](animal)
- buy [dog](animal)
- buy [parrot](animal)
- I would like to buy a [cat](animal)
- I would like to buy a [dog](animal)
- I would like to buy a [parrot](animal)

Also, I noticed that in rasa’s restaurant bot, they sometimes repeat the same example over and over again, sometimes up to seven times, like below:

##intent:inform
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)

Why is that necessary? What affect does this have on the understanding? How would more occurrences of the same single word in the same position be an indicator that it is an appropriate response, especially if you had something like the below where a different value of the same entity was repeated the same amount of times?

##intent:inform
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- buy [dog](animal)
- I would like to buy a [dog](animal)

I am curious about the above because I see it in the old format in franken_data.json. To get a better understanding do q quick search on “text”: “cheap” You should get 14 of the same results.

github.com

RasaHQ/rasa_core/blob/master/examples/restaurantbot/data/franken_data.json

{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "text": "moderately priced restaurant that serves creative food", 
        "intent": "inform", 
        "entities": [
          {
            "start": 41, 
            "end": 49, 
            "value": "creative", 
            "entity": "cuisine"
          }, 
          {
            "start": 0, 
            "end": 10, 
            "value": "moderate", 
            "entity": "price"
          }
        ]

This file has been truncated. show original

Thank you, any advice is appreciated.

deepshet · August 26, 2018, 10:10pm

No. But you need sufficient(i.e. test it out and see!) number of examples. See also chatito

Not sure of duplicates - my recollection is that this is someone elses provided data.

alpes · August 27, 2018, 5:28am

hello, i have trained the model with different intents where i got training_data.json in that we have perfect intents and entities intent examples: 10 (5 distinct intents) - Found intents: ‘budget’, ‘customers’, ‘revenue’, ‘Debt/EBITDAX’, ‘production’ - entity examples: 10 (4 distinct entities) - found entities: ‘verb’, ‘value’, ‘time’, ‘speed’ After that i was tested same sentences which i was included in training where i’m getting wrong intents and entities are in empty list… pprint(interpreter.parse(“54% revenue growth in the last two quarters to $10m”)) here i must get revenue as my intent but its showing production { “intent”: { “name”: “production”, “confidence”: 0.5166328781001349 }, “entities”: [], “intent_ranking”: [ { “name”: “production”, “confidence”: 0.5166328781001349 }, { “name”: “revenue”, “confidence”: 0.1783357881550128 }, { “name”: “Debt/EBITDAX”, “confidence”: 0.14805304182717913 }, { “name”: “customers”, “confidence”: 0.08582269154352269 }, { “name”: “budget”, “confidence”: 0.0711556003741502 } ], “text”: “54% revenue growth in the last two quarters to $10m” }

regards, alpes

akelad · August 27, 2018, 8:40am

As @deepshet said, you don’t need to provide every single example, but you do need to provide a variety of examples. The linked tool called Chatito should help you with this.

As for duplicates, yeah this is the bAbI dataset converted into our format. The duplicates don’t change anything in accuracy.

liloup1789 · March 24, 2020, 3:04pm

Hello, Could I see the babi dataset converted ? I have the original (facebook format) and a ParlAI converted version, but I don’t see Rasa’s

Topic		Replies	Views
What happens if my nlu.md file contains multiple of the same intent? Rasa Open Source	0	508	April 8, 2020
Rasa NLU training data Rasa Open Source	1	801	October 17, 2018
Need help for data training Rasa Open Source	6	455	March 13, 2020
Question answering Rasa Open Source	6	1379	August 26, 2021
Negative Training Data Rasa Open Source	1	1087	May 21, 2019

Rasa NLU - Understanding Training Data

Related topics