Donate your NLU training data!

Emma · November 13, 2019, 12:30pm

We have created a new repository that lives in RasaHQ/NLU-training-data with the goal of providing basic training data for developing chatbots.

We are currently testing this initiative, and we will need your help to build this open source dataset - which means it’s now open for contributions!

How do I donate my training data?
Within the Github read.me, you will find a guide on how to donate your data. The repository is sectioned into different categories of intent, and there is also a FAQ section to help you understand where to put your training data.

What about training data that’s not in English?
Right now, we are unable to evaluate the quality of all language contributions, and therefore, during the initial phase we can only accept English training data to the repository.
However, we understand that the Rasa community is a global one, and in the long-term we would like to find a solution for this in collaboration with the community.

Your feedback
We created this based on suggestions from the Rasa community and we’d love to improve it in a direction that would be beneficial for you and other developers, therefore, it would also be great to have your thoughts on the following:

Do you think that the organisation of the repository works well and is intuitive?
Do you feel this would be a valuable resource for the community?

davi · November 13, 2019, 1:41pm

Hello @Emma, Its just what I`m looking for.

I`ll contribute with some data, but how about other languages ? Maybe change file name or some folder structure.

Thanks, great initiative!

-Best

markusgl · November 13, 2019, 2:05pm

As mentioned in the Readme only English is accepted at the moment.

Emma · November 13, 2019, 2:29pm

Hey @davi,

That’s awesome to hear!

Exactly, as @markusgl kindly mentioned, first we would like to test it out in English so that we can evaluate the quality. If we are able to open this up to localised training data in future, we would adjust the repo structure retroactively to specify the language and make this much clearer.

Despite all of this, it’s great to know that you’re interested in donating localised training data and letting us know really helps us to understand what the community is looking for.

davi · November 13, 2019, 2:31pm

You are right, sorry. Anyway just forked the repo, when ready for others languages i`ll make a PR.

Thanks!

jonathanpwheat · November 13, 2019, 3:21pm

Just added a pull request with a BUNCH of new intents (54) for smalltalk and a handful of new intents (5) for mood, with some additional data to some of the out of the box intents in both of those categories.

Looking forward to seeing what others will contribute!

abhishakskilrock · November 13, 2019, 4:33pm

Hey @Emma

I have added different 86 intents for small talk. Please review it and if you find it useful do let me know.

jonathanpwheat · November 13, 2019, 4:40pm

Wow @abhishakskilrock - those are fantastic, well done

abhishakskilrock · November 13, 2019, 4:43pm

Hey @jonathanpwheat

Thanks for complement btw you also did a fantastic job by providing 54 different intents.

jonathanpwheat · November 13, 2019, 4:47pm

Thanks, I see a some overlap of intents, but you have all the context entities setup, whereas I just have basic data.

I’m glad this is an open source shared repo, because I’ll be implementing your nlu data into the small talk portion of my bot

abhishakskilrock · November 13, 2019, 4:53pm

Sure @jonathanpwheat, after all this is the real purpose of open-source, where one person can also share the benefits of others contribution.

Emma · November 14, 2019, 11:10am

@jonathanpwheat & @abhishakskilrock,

Wow guys! Thank you so much for submitting all of this training data! we should be able to review your PRs before the end of this week.

It’s also great to see such a wholesome discussion going on here, we are very fortunate to have this supportive community.

azharameen · November 15, 2019, 5:25am

Hi, I have added my employment bot nlu data. This is my first contribution, please guide me if i have done any mistakes.

Thanks

Emma · February 17, 2021, 2:55pm

Hey everyone,

I want to share a couple of updates to this repo:

1. Grab domain specific data with the Intent Example Finder

Research Advocate Vincent @koaning developed a special tool Intent Example Finder that provides an interface for easy collection of domain specific training data. Use the selector in the sidebar to construct NLU data as a starting point and use the clipboard icon to quickly copy the data!

Kapture 2021-02-17 at 12.07.23

2. Domain files are now YAML, and Rasa 2.x ready

You can also collect this data in YAML, instead of the previous Markdown format. You can find more information on the deprecation of Markdown and commands to convert Markdown to YAML, on our docs here.

Today, we have 1196 examples on this repo! Thank you to everyone supporting this crowdsourcing project and donating data!

mohammed · February 24, 2021, 6:09pm

Hi, is it supported only for English or with multiple languages?

shazadmaved · February 24, 2021, 6:53pm

Hi Mohammed , Rasa supports multiple languages as well not just English

mohammed · March 3, 2021, 11:16am

I mean the data hub, are intents present with different languages or just english @shazadmaved

staticdev · March 17, 2021, 9:43am

@Emma I really think Rasa should invest more in multi-language support, this repo is an example of this. From the top 10 countries with more internet users, only 1 is English native. There is a huge market being neglected here:

internet-users

Topic		Replies	Views
Looking for Hindi NLU data Rasa Open Source	2	1267	June 5, 2020
NLU Training Data Source Suggestions Rasa Open Source	1	333	December 18, 2021
Accepting of PR for NLU training data Contributing Code	1	393	October 3, 2020
NLU Training data generators like chatito Rasa Open Source	3	1053	May 29, 2020
Request for conversational dataset Rasa Open Source	1	567	April 4, 2019

Donate your NLU training data!

1. Grab domain specific data with the Intent Example Finder

2. Domain files are now YAML, and Rasa 2.x ready

Related topics