New Training Data Format Ideas

fede · June 3, 2020, 8:06am

This could also help in case if more features are added in the training format then we will not require too many special symbols for it. (or so I think )

Yep! That’s one of the advantages of the YAML format.

But, (in case if you guys adopt full YAML) would it be backward compatible?

Ideally we would have some sort of mechanism that allows making the transition smoother. Either supporting both new and old formats at the same time, or creating a tool that migrates data to the new format.

@niveK

As much as I love Markdown, having to write an assortment of regexes for finding intent/entity combinations, story types etc. is definitely a pain. If we had unified, structured training data we could also glean more information for a given story that we can then aggregate to sample from. Things like finding all stories with only a single turn, stories that have repeating patterns, are all useful to be able to check, once your training data becomes large enough.

That’s an interesting point! Just to be sure I understood correctly, are you saying that by having a YAML format you would be able to use standard YAML tooling to parse/process your data easily?

Also a quick note on lookup table format, degiz – it’s supposed to be a .txt with terms delimited by newlines, so there’s yet another format to manage. A YAML list would be great to use for those!

For both formats (original extended and YAML) we decided to change lookup tables so that their elements are included directly in the training data file. In the case of YAML, it’s just a list of strings (see bottom of the nlu.yml example).

Topic		Replies	Views
Rasa NLU training data - JSON or markdown? Rasa Open Source	4	3405	July 25, 2019
Tool for training data in Markdown format? Rasa Open Source	1	773	December 21, 2018
Create training data in markdown format Rasa Open Source	0	522	December 5, 2018
Yaml training data not appearing in rasa-x [Deprecated] Rasa X Community Edition	2	340	November 19, 2020
Questions when training a bot freshly migrated to 2.x Rasa Open Source	0	206	April 12, 2021

New Training Data Format Ideas

Related topics