New Training Data Format Ideas

degiz · June 2, 2020, 6:13pm

For some time we’ve been thinking about what should go in the next major release of Rasa Open Source. One thing on the top of the list was a change to the training and configuration data format.

The current training data format has been with us for quite a while. Since the training data files are just normal text files and are maintained (written and read) by humans (not bots), Markdown as a format is quite handy. However, at some point we’ve started to experience some limitations using it:

training stories are using Markdown only formally, in fact it’s quite a mix of special symbols and JSON, making it quite tricky to write
any new feature we’d like to add would bring even more special symbols

At the same time we received several important feature requests like:

support for optional custom metadata in training examples
ability to split possibly long configuration files into smaller separate ones
add buttons and images support for the response selector

We also want to unify the way we define entities and make it “future-proof”, so that it’s expandable enough for possible upcoming features.

Of course, the training data and configuration files should still stay human-readable and editable.

So we’ve started to discuss what we can do to address that and came up with two ideas that we’d like to share with you.

Let’s quickly recall how a typical Rasa Open Source project is organized now.

Training data files

NLU.md
Stories.md
Responses.md
Lookup-Table.md

Configuration files

domain.yml
config.yml
endpoints.yml
credentials.yml

Option #1: Extended Original Format

We generalize the existing “entity roles” syntax so that for every intent example, an entity marked with [] (square brackets) must be always followed by a valid JSON describing it:

Example

How much is the flight to [London]{"entity": "city", "role": "to"}
I'd like to order a [barbeque]{"entity": "dish", "synonym": "bbq"}
I do not know the name of the airport in [new york]{"entity": "city"}

We leave what we already have in the Markdown, but we extend it with:

“rich responses” for ResponseSelector
metadata syntax with %
mandatory story: key for stories
lookup tables can be defined in a separate file without a reference from NLU.md

stories.md

<!-- note that the next line starts with `story:` key -->
## story: greet + location/price + cuisine + num people
% Metadata is defined using % (percentage)
* greet
   - action_ask_howcanhelp
* inform{"location": "...", "price": "..."}
   - action_on_it
   - action_ask_cuisine
* inform{"cuisine": "..."}
   - action_ask_numpeople
* inform{"people": "..."}
   - action_ack_dosearch

<!-- note that the next line starts with `story:` key -->
## story: first story
* greet
   - action_ask_user_question
> check_asked_question

<!-- note that the next line starts with `story:` key -->
## story: user affirms question
> check_asked_question
* affirm
  - action_handle_affirmation
> check_handled_affirmation

<!-- note that the next line starts with `story:` key -->
## story: story with OR
* affirm OR thankyou
  - action_handle_affirmation

## story: restaurant form
* request_restaurant
  - restaurant_form
  - form{"name": "restaurant_form"}
  - form{"name": null}

nlu.md

## intent: check_balance
% Metadata is defined using % (percentage)
- what is my balance <!-- no entity -->
- how much do I have on my [savings account]{"entity": "source_account", "value": "savings"} <!-- synonyms, method 1-->
% Example metadata => Sentiment: normal
- Could I pay in [yen]{"entity": "currency"}?  <!-- entity matched by lookup table -->

## intent: greet
- hey
- hello

## synonym: savings   <!-- synonyms, method 2 -->
- pink pig

## regex: zipcode
- [0-9]{5}

<!-- If a lookup table gets too large, it can be moved to a separate file (like any other component) -->
## lookup:additional_currencies
- Peso
- Euro
- Dollar 

## intent: chitchat/ask_name
- what's your name
- who are you?
- what are you called?

## intent: chitchat/ask_weather
- how's the weather?
- is it sunny where you are?

## intent: faq/other_language
- Can you do another language?

We get rid of responses.md and the ResponseSelector responses go directly into the domain. The domain file can be split into multiple files. Below is an example of a domain file with only responses defined.

domain_responses.yml

responses:
  chitchat/ask_weather:
  - text: Where do you want to check the weather?
    buttons:
    - title: Current location
      payload: here
    - title: Other place
      payload: other_location
  faq/other_language:
  - text: I can only do English right now.
    image: "https://i.imgur.com/nGF1K8f.jpg"
  - text: I only speak English!

Option #2: YAML Format

As the caption says, we go 100% YAML.

With this format you are free to distribute Domain, NLU and Core data among any number of YAML files. The Rasa parser will only read top-level keys (e.g. stories, nlu, slots, etc.) to understand which information it’s parsing.

We keep those configuration files that are already using YAML.

We also keep the “generalized” entities syntax from Option #1.

stories.yml

stories:
- story: SCENARIO CHECK
  metadata: This is metadata!
  steps:
  - user: /greet
  - action: utter_SCENARIOCHECK

- story: give_travel_plan CHECK
  steps:
  - user: /greet 
    entities: # This means that following entities were extracted from the utterance
    - context_scenario
    - holiday_name
  - slot: # This means that the slot has been set by this point in the conversation
      context_scenario: holiday
      holiday_name: thanksgiving
  - action: action_disclaimer
  - action: utter_holiday-travel_offer_help

- story: give_travel_plan CHECK
  steps:
  - user: /greet # User message
    entities:
    - context_scenario
    - holiday_name
  - slot:
      context_scenario: holiday
      holiday_name: thanksgiving
  - action: action_disclaimer  # Run actions
  - action: utter_holiday-travel_offer_help

- story: story_with_a_checkpoint_1
  steps:
  - user: /greet
  - action: utter_greet
  - checkpoint: greet_checkpoint

- story: story_with_a_checkpoint_2
  steps:
  - checkpoint: greet_checkpoint
  - user: /book_flight
  - action: action_book_flight

- story: story_with_or
  steps:
  - user: /book_flight OR /book_train
  - action: action_ask_details

# End to end testing format:
e2e_tests:
- story: A basic end-to-end test
  steps:
  - user: /greet hello  # Same as regular user message, with text added to the end
  - action: utter_ask_howcanhelp
  - user: /inform show me [chinese](cuisine) restaurants
  - action: utter_ask_location
  - user: /inform in [Paris](location)
  - action: utter_ask_price

nlu.yml

nlu:
- intent: estimate_emissions
  metadata:
    author: Some example metadata on intent level!
  examples:
  - "how much CO2 will that use?"
  - 'how much carbon will a one way flight from [new york]{"entity": "city", "role": "from"} to california produce?' # Need quotes to include ':' in a string

- intent: buy_offsets
  examples:
  - text: 'I want to buy offsets'
    metadata: Example metadata
  - text: 'I want buy offset'

- intent: inform
  examples:
  - '[NEW YORK]{"entity": "city"}'
  - 'I said I''m going to [Boston]{"entity": "city", "role": "to"}' # Single quotes are escaped as ''

- intent: deny
  examples:
  - 'no' # Some values will need to be quoted or else they'll be parsed as booleans. We will issue warnings to help users avoid this.
  - 'false'
  - 'no thanks'

- synonym: savings
  examples:
  - 'pink pig'
  - 'savings account'

- regex: zipcode
  examples:
  - '[0-9]{5}'

- lookup: additional_currencies # Looktables are inline, their elements are included in the same file
  examples:
  - 'Peso'
  - 'Euro'
  - 'Dollar' # If a lookup table gets too large, it can be moved to a separate file (like any other component)

We get rid of responses.md and the ResponseSelector responses go directly into the domain. The domain file can be split into multiple files. Below is an example of a domain file with only responses defined.

domain_responses.yml

responses:
  chitchat/ask_weather:
  - text: Where do you want to check the weather?
    buttons:
    - title: Current location
      payload: here
    - title: Other place
      payload: other_location
  faq/other_language:
  - text: I can only do English right now.
    image: "https://i.imgur.com/nGF1K8f.jpg"
  - text: I only speak English!

We would like to ask you for the feedback on the two formats above - which one would you prefer and why?

Let’s keep the discussion going in the comments below this post!

Titus · June 2, 2020, 6:23pm

Think about it ‘high level’: import a CSV file with two columns: questions & answers => FAQ chatbot. Make it so that it needs no effort to be set up. (Don’t get me wrong: I love Rasa)

staticdev · June 2, 2020, 6:46pm

I would go for option #2. YAML is a widespread configuration format. I don’t see markdown as adequate since it is thought as a documentation format. Also a mix of markdown and json make it all more complex and don’t see how things evolve and get better this way (as a future-proof format like you intend).

saurabh-m523 · June 2, 2020, 6:46pm

I would personally favor going full YAML.

This is mainly because

This format seems more easily understandable for someone who might be just getting started with rasa because it does not require many special symbols because we have keys to define what we meant.
This could also help in case if more features are added in the training format then we will not require too many special symbols for it. (or so I think )

But, (in case if you guys adopt full YAML) would it be backward compatible?

samscudder · June 2, 2020, 7:58pm

Full YAML

TrueSon · June 2, 2020, 8:51pm

At a quick glance, option 2 is my preferred option. It seems cleaner and easier to understand.

niveK · June 2, 2020, 8:52pm

Full YAML would be great! Would be a self-documenting training format, which I think is it’s biggest strength, in addition to unifying the language use across the board.

As much as I love Markdown, having to write an assortment of regexes for finding intent/entity combinations, story types etc. is definitely a pain. If we had unified, structured training data we could also glean more information for a given story that we can then aggregate to sample from. Things like finding all stories with only a single turn, stories that have repeating patterns, are all useful to be able to check, once your training data becomes large enough.

Also a quick note on lookup table format, @degiz – it’s supposed to be a .txt with terms delimited by newlines, so there’s yet another format to manage. A YAML list would be great to use for those!

arun_singh · June 3, 2020, 2:21am

YAML would be better.

IgNoRaNt23 · June 3, 2020, 6:55am

Afaik it’s not possible now distribute the data from the domain to multiple files (even though it’s already yaml). That’s a feature I’d like to see.

fede · June 3, 2020, 8:06am

@saurabh-m523

This could also help in case if more features are added in the training format then we will not require too many special symbols for it. (or so I think )

Yep! That’s one of the advantages of the YAML format.

But, (in case if you guys adopt full YAML) would it be backward compatible?

Ideally we would have some sort of mechanism that allows making the transition smoother. Either supporting both new and old formats at the same time, or creating a tool that migrates data to the new format.

@niveK

As much as I love Markdown, having to write an assortment of regexes for finding intent/entity combinations, story types etc. is definitely a pain. If we had unified, structured training data we could also glean more information for a given story that we can then aggregate to sample from. Things like finding all stories with only a single turn, stories that have repeating patterns, are all useful to be able to check, once your training data becomes large enough.

That’s an interesting point! Just to be sure I understood correctly, are you saying that by having a YAML format you would be able to use standard YAML tooling to parse/process your data easily?

Also a quick note on lookup table format, degiz – it’s supposed to be a .txt with terms delimited by newlines, so there’s yet another format to manage. A YAML list would be great to use for those!

For both formats (original extended and YAML) we decided to change lookup tables so that their elements are included directly in the training data file. In the case of YAML, it’s just a list of strings (see bottom of the nlu.yml example).

Paras · June 3, 2020, 8:18am

Option 2: YAML Format

randomsven · June 3, 2020, 9:46am

+1 for YAML: A structured, well-defined format, like YAML would most definitely be preferred - you can generate MD for readability from YAML, more difficult to go the other way. Don’t invent new formats (mix of MD and JSON) if you can use established and battle-tested formats and associated tool ecosystem.

KhalidBentaleb · June 3, 2020, 10:29am

I would go for option #2. It seems cleaner and easier to understand.

GeovanaRamos · June 3, 2020, 11:43am

Option #2 looks great, as long as we have a smooth transition and backwards compatibility.

ezhvsalate · June 3, 2020, 1:57pm

Like the second option with YAML. It will be much easier to work with it (parse / export) using any external UI.

alfredfrancis · June 3, 2020, 2:23pm

YAML

jamesmf · June 3, 2020, 2:27pm

What would this mean for json formatted files?

tomp · June 3, 2020, 2:48pm

I never quite understood why Rasa was putting machine readable data in markdown format. I figured it was a way to appeal to non-developers because it is, in theory, more human-readable. … Just not for this data. The markdown format is a mismatch. Rasa bot data is not what markdown was intended to encode.

All-Yaml makes more sense.

But I would go one step further and allow most data to be defined in JSON. I mean, why not allow a give file to be encoded in more than one format? Any standard text-data file format would work because all text data formats are also intended to be human-readable.

Eventually, Rasa may need to develop their own file format that is streamlined for the Rasa-specific use-cases.

Allowing split files is essential. That should be extended to credential files, with one credential file being committed to source code repo that does not contain any personal credential information but which can be used as a template for those who want to clone the repo and start their own. And a second credentials file that overrides the first at runtime and is only edited and saved locally because it contains sensitive information.

In the end, I would be happy for a system that allowed text files to be both human- and-machine readable – and the code that updates a human-readable file should not destroy the order of elements nor the whitespace-formatting that the human established.

yashim77 · June 3, 2020, 4:57pm

Option #2 YAML

martin · June 3, 2020, 6:21pm

Option #2 looks much better than option #1

Possiblities to split all files (domain, nlu and stories) would be fantastic!

Topic		Replies	Views
Valid yaml example for training via rest api Tutorials, Resources & Videos	1	2529	February 12, 2021
Rasa 2.0 training files Getting Started with Rasa	21	333	October 13, 2020
Yaml training data not appearing in rasa-x [Deprecated] Rasa X Community Edition	2	341	November 19, 2020
Rasa NLU training data - JSON or markdown? Rasa Open Source	4	3419	July 25, 2019
Create nlu.md train data in yaml format using Rasa_NLU_server Rasa Open Source	1	492	November 29, 2018

New Training Data Format Ideas

Option #1: Extended Original Format

Option #2: YAML Format

Related topics