New Training Data Format Ideas

Hey Rasa @community!

For some time we’ve been thinking about what should go in the next major release of Rasa Open Source. One thing on the top of the list was a change to the training and configuration data format.

The current training data format has been with us for quite a while. Since the training data files are just normal text files and are maintained (written and read) by humans (not bots), Markdown as a format is quite handy. However, at some point we’ve started to experience some limitations using it:

  • training stories are using Markdown only formally, in fact it’s quite a mix of special symbols and JSON, making it quite tricky to write
  • any new feature we’d like to add would bring even more special symbols

At the same time we received several important feature requests like:

  • support for optional custom metadata in training examples
  • ability to split possibly long configuration files into smaller separate ones
  • add buttons and images support for the response selector

We also want to unify the way we define entities and make it “future-proof”, so that it’s expandable enough for possible upcoming features.

Of course, the training data and configuration files should still stay human-readable and editable.

So we’ve started to discuss what we can do to address that and came up with two ideas that we’d like to share with you.

Let’s quickly recall how a typical Rasa Open Source project is organized now.

Training data files
  • NLU.md
  • Stories.md
  • Responses.md
  • Lookup-Table.md

Configuration files
  • domain.yml
  • config.yml
  • endpoints.yml
  • credentials.yml

Option #1: Extended Original Format

We generalize the existing “entity roles” syntax so that for every intent example, an entity marked with [] (square brackets) must be always followed by a valid JSON describing it:

Example
How much is the flight to [London]{"entity": "city", "role": "to"}
I'd like to order a [barbeque]{"entity": "dish", "synonym": "bbq"}
I do not know the name of the airport in [new york]{"entity": "city"}

We leave what we already have in the Markdown, but we extend it with:

  • “rich responses” for ResponseSelector
  • metadata syntax with %
  • mandatory story: key for stories
  • lookup tables can be defined in a separate file without a reference from NLU.md

stories.md
<!-- note that the next line starts with `story:` key -->
## story: greet + location/price + cuisine + num people
% Metadata is defined using % (percentage)
* greet
   - action_ask_howcanhelp
* inform{"location": "...", "price": "..."}
   - action_on_it
   - action_ask_cuisine
* inform{"cuisine": "..."}
   - action_ask_numpeople
* inform{"people": "..."}
   - action_ack_dosearch

<!-- note that the next line starts with `story:` key -->
## story: first story
* greet
   - action_ask_user_question
> check_asked_question

<!-- note that the next line starts with `story:` key -->
## story: user affirms question
> check_asked_question
* affirm
  - action_handle_affirmation
> check_handled_affirmation

<!-- note that the next line starts with `story:` key -->
## story: story with OR
* affirm OR thankyou
  - action_handle_affirmation

## story: restaurant form
* request_restaurant
  - restaurant_form
  - form{"name": "restaurant_form"}
  - form{"name": null}

nlu.md
## intent: check_balance
% Metadata is defined using % (percentage)
- what is my balance <!-- no entity -->
- how much do I have on my [savings account]{"entity": "source_account", "value": "savings"} <!-- synonyms, method 1-->
% Example metadata => Sentiment: normal
- Could I pay in [yen]{"entity": "currency"}?  <!-- entity matched by lookup table -->

## intent: greet
- hey
- hello

## synonym: savings   <!-- synonyms, method 2 -->
- pink pig

## regex: zipcode
- [0-9]{5}

<!-- If a lookup table gets too large, it can be moved to a separate file (like any other component) -->
## lookup:additional_currencies
- Peso
- Euro
- Dollar 

## intent: chitchat/ask_name
- what's your name
- who are you?
- what are you called?

## intent: chitchat/ask_weather
- how's the weather?
- is it sunny where you are?

## intent: faq/other_language
- Can you do another language?

We get rid of responses.md and the ResponseSelector responses go directly into the domain. The domain file can be split into multiple files. Below is an example of a domain file with only responses defined.

domain_responses.yml
responses:
  chitchat/ask_weather:
  - text: Where do you want to check the weather?
    buttons:
    - title: Current location
      payload: here
    - title: Other place
      payload: other_location
  faq/other_language:
  - text: I can only do English right now.
    image: "https://i.imgur.com/nGF1K8f.jpg"
  - text: I only speak English!

Option #2: YAML Format

As the caption says, we go 100% YAML.

With this format you are free to distribute Domain, NLU and Core data among any number of YAML files. The Rasa parser will only read top-level keys (e.g. stories, nlu, slots, etc.) to understand which information it’s parsing.

We keep those configuration files that are already using YAML.

We also keep the “generalized” entities syntax from Option #1.

stories.yml
stories:
- story: SCENARIO CHECK
  metadata: This is metadata!
  steps:
  - user: /greet
  - action: utter_SCENARIOCHECK

- story: give_travel_plan CHECK
  steps:
  - user: /greet 
    entities: # This means that following entities were extracted from the utterance
    - context_scenario
    - holiday_name
  - slot: # This means that the slot has been set by this point in the conversation
      context_scenario: holiday
      holiday_name: thanksgiving
  - action: action_disclaimer
  - action: utter_holiday-travel_offer_help

- story: give_travel_plan CHECK
  steps:
  - user: /greet # User message
    entities:
    - context_scenario
    - holiday_name
  - slot:
      context_scenario: holiday
      holiday_name: thanksgiving
  - action: action_disclaimer  # Run actions
  - action: utter_holiday-travel_offer_help

- story: story_with_a_checkpoint_1
  steps:
  - user: /greet
  - action: utter_greet
  - checkpoint: greet_checkpoint

- story: story_with_a_checkpoint_2
  steps:
  - checkpoint: greet_checkpoint
  - user: /book_flight
  - action: action_book_flight

- story: story_with_or
  steps:
  - user: /book_flight OR /book_train
  - action: action_ask_details

# End to end testing format:
e2e_tests:
- story: A basic end-to-end test
  steps:
  - user: /greet hello  # Same as regular user message, with text added to the end
  - action: utter_ask_howcanhelp
  - user: /inform show me [chinese](cuisine) restaurants
  - action: utter_ask_location
  - user: /inform in [Paris](location)
  - action: utter_ask_price

nlu.yml
nlu:
- intent: estimate_emissions
  metadata:
    author: Some example metadata on intent level!
  examples:
  - "how much CO2 will that use?"
  - 'how much carbon will a one way flight from [new york]{"entity": "city", "role": "from"} to california produce?' # Need quotes to include ':' in a string

- intent: buy_offsets
  examples:
  - text: 'I want to buy offsets'
    metadata: Example metadata
  - text: 'I want buy offset'

- intent: inform
  examples:
  - '[NEW YORK]{"entity": "city"}'
  - 'I said I''m going to [Boston]{"entity": "city", "role": "to"}' # Single quotes are escaped as ''

- intent: deny
  examples:
  - 'no' # Some values will need to be quoted or else they'll be parsed as booleans. We will issue warnings to help users avoid this.
  - 'false'
  - 'no thanks'

- synonym: savings
  examples:
  - 'pink pig'
  - 'savings account'

- regex: zipcode
  examples:
  - '[0-9]{5}'

- lookup: additional_currencies # Looktables are inline, their elements are included in the same file
  examples:
  - 'Peso'
  - 'Euro'
  - 'Dollar' # If a lookup table gets too large, it can be moved to a separate file (like any other component)

We get rid of responses.md and the ResponseSelector responses go directly into the domain. The domain file can be split into multiple files. Below is an example of a domain file with only responses defined.

domain_responses.yml
responses:
  chitchat/ask_weather:
  - text: Where do you want to check the weather?
    buttons:
    - title: Current location
      payload: here
    - title: Other place
      payload: other_location
  faq/other_language:
  - text: I can only do English right now.
    image: "https://i.imgur.com/nGF1K8f.jpg"
  - text: I only speak English!

We would like to ask you for the feedback on the two formats above - which one would you prefer and why?

Let’s keep the discussion going in the comments below this post!

7 Likes

Think about it ‘high level’: import a CSV file with two columns: questions & answers => FAQ chatbot. Make it so that it needs no effort to be set up. (Don’t get me wrong: I love Rasa)

2 Likes

I would go for option #2. YAML is a widespread configuration format. I don’t see markdown as adequate since it is thought as a documentation format. Also a mix of markdown and json make it all more complex and don’t see how things evolve and get better this way (as a future-proof format like you intend).

3 Likes

I would personally favor going full YAML.

This is mainly because

  • This format seems more easily understandable for someone who might be just getting started with rasa because it does not require many special symbols because we have keys to define what we meant.
  • This could also help in case if more features are added in the training format then we will not require too many special symbols for it. (or so I think :sweat_smile:)

But, (in case if you guys adopt full YAML) would it be backward compatible?

1 Like

Full YAML

At a quick glance, option 2 is my preferred option. It seems cleaner and easier to understand.

Full YAML would be great! Would be a self-documenting training format, which I think is it’s biggest strength, in addition to unifying the language use across the board.

As much as I love Markdown, having to write an assortment of regexes for finding intent/entity combinations, story types etc. is definitely a pain. If we had unified, structured training data we could also glean more information for a given story that we can then aggregate to sample from. Things like finding all stories with only a single turn, stories that have repeating patterns, are all useful to be able to check, once your training data becomes large enough.

Also a quick note on lookup table format, @degizit’s supposed to be a .txt with terms delimited by newlines, so there’s yet another format to manage. A YAML list would be great to use for those!

YAML would be better.

Afaik it’s not possible now distribute the data from the domain to multiple files (even though it’s already yaml). That’s a feature I’d like to see.

@saurabh-m523

  • This could also help in case if more features are added in the training format then we will not require too many special symbols for it. (or so I think :sweat_smile:)

Yep! That’s one of the advantages of the YAML format.

But, (in case if you guys adopt full YAML) would it be backward compatible?

Ideally we would have some sort of mechanism that allows making the transition smoother. Either supporting both new and old formats at the same time, or creating a tool that migrates data to the new format.

@niveK

As much as I love Markdown, having to write an assortment of regexes for finding intent/entity combinations, story types etc. is definitely a pain. If we had unified, structured training data we could also glean more information for a given story that we can then aggregate to sample from. Things like finding all stories with only a single turn, stories that have repeating patterns, are all useful to be able to check, once your training data becomes large enough.

That’s an interesting point! Just to be sure I understood correctly, are you saying that by having a YAML format you would be able to use standard YAML tooling to parse/process your data easily?

Also a quick note on lookup table format, degiz – it’s supposed to be a .txt with terms delimited by newlines, so there’s yet another format to manage. A YAML list would be great to use for those!

For both formats (original extended and YAML) we decided to change lookup tables so that their elements are included directly in the training data file. In the case of YAML, it’s just a list of strings (see bottom of the nlu.yml example).

2 Likes

Option 2: YAML Format

+1 for YAML: A structured, well-defined format, like YAML would most definitely be preferred - you can generate MD for readability from YAML, more difficult to go the other way. Don’t invent new formats (mix of MD and JSON) if you can use established and battle-tested formats and associated tool ecosystem.

I would go for option #2. It seems cleaner and easier to understand.

Option #2 looks great, as long as we have a smooth transition and backwards compatibility.

Like the second option with YAML. It will be much easier to work with it (parse / export) using any external UI.

YAML :star_struck:

What would this mean for json formatted files?

I never quite understood why Rasa was putting machine readable data in markdown format. I figured it was a way to appeal to non-developers because it is, in theory, more human-readable. … Just not for this data. The markdown format is a mismatch. Rasa bot data is not what markdown was intended to encode.

All-Yaml makes more sense.

But I would go one step further and allow most data to be defined in JSON. I mean, why not allow a give file to be encoded in more than one format? Any standard text-data file format would work because all text data formats are also intended to be human-readable.

Eventually, Rasa may need to develop their own file format that is streamlined for the Rasa-specific use-cases.

Allowing split files is essential. That should be extended to credential files, with one credential file being committed to source code repo that does not contain any personal credential information but which can be used as a template for those who want to clone the repo and start their own. And a second credentials file that overrides the first at runtime and is only edited and saved locally because it contains sensitive information.

In the end, I would be happy for a system that allowed text files to be both human- and-machine readable – and the code that updates a human-readable file should not destroy the order of elements nor the whitespace-formatting that the human established.

1 Like

Option #2 YAML

Option #2 looks much better than option #1

Possiblities to split all files (domain, nlu and stories) would be fantastic! :+1: