Hey Rasa @community!
For some time we’ve been thinking about what should go in the next major release of Rasa Open Source. One thing on the top of the list was a change to the training and configuration data format.
The current training data format has been with us for quite a while. Since the training data files are just normal text files and are maintained (written and read) by humans (not bots), Markdown as a format is quite handy. However, at some point we’ve started to experience some limitations using it:
- training stories are using Markdown only formally, in fact it’s quite a mix of special symbols and JSON, making it quite tricky to write
- any new feature we’d like to add would bring even more special symbols
At the same time we received several important feature requests like:
- support for optional custom metadata in training examples
- ability to split possibly long configuration files into smaller separate ones
- add buttons and images support for the response selector
We also want to unify the way we define entities and make it “future-proof”, so that it’s expandable enough for possible upcoming features.
Of course, the training data and configuration files should still stay human-readable and editable.
So we’ve started to discuss what we can do to address that and came up with two ideas that we’d like to share with you.
Let’s quickly recall how a typical Rasa Open Source project is organized now.
Training data files
- NLU.md
- Stories.md
- Responses.md
- Lookup-Table.md
Configuration files
- domain.yml
- config.yml
- endpoints.yml
- credentials.yml
Option #1: Extended Original Format
We generalize the existing “entity roles” syntax so that for every intent example, an entity marked with []
(square brackets) must be always followed by a valid JSON describing it:
Example
How much is the flight to [London]{"entity": "city", "role": "to"}
I'd like to order a [barbeque]{"entity": "dish", "synonym": "bbq"}
I do not know the name of the airport in [new york]{"entity": "city"}
We leave what we already have in the Markdown, but we extend it with:
- “rich responses” for
ResponseSelector
- metadata syntax with
%
- mandatory
story:
key for stories - lookup tables can be defined in a separate file without a reference from
NLU.md
stories.md
<!-- note that the next line starts with `story:` key -->
## story: greet + location/price + cuisine + num people
% Metadata is defined using % (percentage)
* greet
- action_ask_howcanhelp
* inform{"location": "...", "price": "..."}
- action_on_it
- action_ask_cuisine
* inform{"cuisine": "..."}
- action_ask_numpeople
* inform{"people": "..."}
- action_ack_dosearch
<!-- note that the next line starts with `story:` key -->
## story: first story
* greet
- action_ask_user_question
> check_asked_question
<!-- note that the next line starts with `story:` key -->
## story: user affirms question
> check_asked_question
* affirm
- action_handle_affirmation
> check_handled_affirmation
<!-- note that the next line starts with `story:` key -->
## story: story with OR
* affirm OR thankyou
- action_handle_affirmation
## story: restaurant form
* request_restaurant
- restaurant_form
- form{"name": "restaurant_form"}
- form{"name": null}
nlu.md
## intent: check_balance
% Metadata is defined using % (percentage)
- what is my balance <!-- no entity -->
- how much do I have on my [savings account]{"entity": "source_account", "value": "savings"} <!-- synonyms, method 1-->
% Example metadata => Sentiment: normal
- Could I pay in [yen]{"entity": "currency"}? <!-- entity matched by lookup table -->
## intent: greet
- hey
- hello
## synonym: savings <!-- synonyms, method 2 -->
- pink pig
## regex: zipcode
- [0-9]{5}
<!-- If a lookup table gets too large, it can be moved to a separate file (like any other component) -->
## lookup:additional_currencies
- Peso
- Euro
- Dollar
## intent: chitchat/ask_name
- what's your name
- who are you?
- what are you called?
## intent: chitchat/ask_weather
- how's the weather?
- is it sunny where you are?
## intent: faq/other_language
- Can you do another language?
We get rid of responses.md
and the ResponseSelector
responses go directly into the domain. The domain file can be split into multiple files. Below is an example of a domain file with only responses defined.
domain_responses.yml
responses:
chitchat/ask_weather:
- text: Where do you want to check the weather?
buttons:
- title: Current location
payload: here
- title: Other place
payload: other_location
faq/other_language:
- text: I can only do English right now.
image: "https://i.imgur.com/nGF1K8f.jpg"
- text: I only speak English!
Option #2: YAML Format
As the caption says, we go 100% YAML.
With this format you are free to distribute Domain, NLU and Core data among any number of YAML files. The Rasa parser will only read top-level keys (e.g. stories
, nlu
, slots
, etc.) to understand which information it’s parsing.
We keep those configuration files that are already using YAML.
We also keep the “generalized” entities syntax from Option #1
.
stories.yml
stories:
- story: SCENARIO CHECK
metadata: This is metadata!
steps:
- user: /greet
- action: utter_SCENARIOCHECK
- story: give_travel_plan CHECK
steps:
- user: /greet
entities: # This means that following entities were extracted from the utterance
- context_scenario
- holiday_name
- slot: # This means that the slot has been set by this point in the conversation
context_scenario: holiday
holiday_name: thanksgiving
- action: action_disclaimer
- action: utter_holiday-travel_offer_help
- story: give_travel_plan CHECK
steps:
- user: /greet # User message
entities:
- context_scenario
- holiday_name
- slot:
context_scenario: holiday
holiday_name: thanksgiving
- action: action_disclaimer # Run actions
- action: utter_holiday-travel_offer_help
- story: story_with_a_checkpoint_1
steps:
- user: /greet
- action: utter_greet
- checkpoint: greet_checkpoint
- story: story_with_a_checkpoint_2
steps:
- checkpoint: greet_checkpoint
- user: /book_flight
- action: action_book_flight
- story: story_with_or
steps:
- user: /book_flight OR /book_train
- action: action_ask_details
# End to end testing format:
e2e_tests:
- story: A basic end-to-end test
steps:
- user: /greet hello # Same as regular user message, with text added to the end
- action: utter_ask_howcanhelp
- user: /inform show me [chinese](cuisine) restaurants
- action: utter_ask_location
- user: /inform in [Paris](location)
- action: utter_ask_price
nlu.yml
nlu:
- intent: estimate_emissions
metadata:
author: Some example metadata on intent level!
examples:
- "how much CO2 will that use?"
- 'how much carbon will a one way flight from [new york]{"entity": "city", "role": "from"} to california produce?' # Need quotes to include ':' in a string
- intent: buy_offsets
examples:
- text: 'I want to buy offsets'
metadata: Example metadata
- text: 'I want buy offset'
- intent: inform
examples:
- '[NEW YORK]{"entity": "city"}'
- 'I said I''m going to [Boston]{"entity": "city", "role": "to"}' # Single quotes are escaped as ''
- intent: deny
examples:
- 'no' # Some values will need to be quoted or else they'll be parsed as booleans. We will issue warnings to help users avoid this.
- 'false'
- 'no thanks'
- synonym: savings
examples:
- 'pink pig'
- 'savings account'
- regex: zipcode
examples:
- '[0-9]{5}'
- lookup: additional_currencies # Looktables are inline, their elements are included in the same file
examples:
- 'Peso'
- 'Euro'
- 'Dollar' # If a lookup table gets too large, it can be moved to a separate file (like any other component)
We get rid of responses.md
and the ResponseSelector
responses go directly into the domain. The domain file can be split into multiple files. Below is an example of a domain file with only responses defined.
domain_responses.yml
responses:
chitchat/ask_weather:
- text: Where do you want to check the weather?
buttons:
- title: Current location
payload: here
- title: Other place
payload: other_location
faq/other_language:
- text: I can only do English right now.
image: "https://i.imgur.com/nGF1K8f.jpg"
- text: I only speak English!
We would like to ask you for the feedback on the two formats above - which one would you prefer and why?
Let’s keep the discussion going in the comments below this post!