New Training Data Format Ideas

@saurabh-m523

  • This could also help in case if more features are added in the training format then we will not require too many special symbols for it. (or so I think :sweat_smile:)

Yep! That’s one of the advantages of the YAML format.

But, (in case if you guys adopt full YAML) would it be backward compatible?

Ideally we would have some sort of mechanism that allows making the transition smoother. Either supporting both new and old formats at the same time, or creating a tool that migrates data to the new format.

@niveK

As much as I love Markdown, having to write an assortment of regexes for finding intent/entity combinations, story types etc. is definitely a pain. If we had unified, structured training data we could also glean more information for a given story that we can then aggregate to sample from. Things like finding all stories with only a single turn, stories that have repeating patterns, are all useful to be able to check, once your training data becomes large enough.

That’s an interesting point! Just to be sure I understood correctly, are you saying that by having a YAML format you would be able to use standard YAML tooling to parse/process your data easily?

Also a quick note on lookup table format, degiz – it’s supposed to be a .txt with terms delimited by newlines, so there’s yet another format to manage. A YAML list would be great to use for those!

For both formats (original extended and YAML) we decided to change lookup tables so that their elements are included directly in the training data file. In the case of YAML, it’s just a list of strings (see bottom of the nlu.yml example).

2 Likes