Problem
As developers start building their assistant, one of the major critical tasks is to add training data for all intents they plan to cover. In order to ensure their assistant is able to handle real world conversations, they need to anticipate different variations in which their users can express the same intent. The more varied training data we add, the hope is that the NLU models become more robust to incoming user messages. We think it will be helpful to have an interactive data augmentation tool which helps a developer in coming up with these variations in order to augment their training data with these varied paraphrases.
Experimental Solution
We built a paraphrasing model, which takes as input a sentence and generates multiple paraphrased versions of the same sentence.
For example:
input message: I want to book an appointment for tomorrow.
Generated Paraphrases
---------------------
- I want to book an appointment for tomorrow.
- i need to book an appointment for another meeting.
- i want to make an appointment for tomorrow.
- i want to book a spot tomorrow.
- i want to book a reservation for tomorrow.
- we need to book an appointment for tomorrow.
- i want to book an appointment for tomorrow at 11 : 00.
- i 'd like to meet tomorrow.
- i 'd like you to arrange a session for tomorrow.
- i 'll book a reservation for tomorrow.
- i 'd like to book the appointment for tomorrow.
- i 'd like to make an appointment for tomorrow.
As you can see, since some of them are grammatically and semantically meaningful, we can build an interactive tool in Rasa Open Source or Rasa X where developers can pick and choose variations they like and then add them to their existing training data.
How to use it?
Since this is an experimental model as of now, we haven’t tightly integrated it as a feature inside Rasa Open Source. If you would like to play with it, a demo is available as a Google colab notebook.
- Input a sentence for which you would like to generate paraphrases.
- Once the model generates the paraphrases, select the ones you like and group them under an intent.
- As an output you get all the selected paraphrases grouped under the selected intent formatted in the markdown training data format. You can copy the formatted output back to your training data file.
- Re-run model training and evaluation to see the augmented data helps in NLU tasks.
Additionally, you can also control the generation of a paraphrase by entering some stop words that you would like the model to not include in its paraphrases. For example, if my input message is - I would like to book an appointment for tuesday
, I would want to generate a few variations of the same sentence but without the word appointment
in them. You can provide multiple such stop words. We found that this is a neat trick to generate more variations in the paraphrases.
Note: The current model can handle input messages in English only as it is trained on an English corpus.
How to contribute?
We would love to hear your feedback on whether augmenting your NLU data with paraphrases from this model helps in improving the performance of your NLU models. Specifically, these would be some interesting scenarios to try -
- Pick one intent which has low number of training examples(say 10). Generate paraphrases for different kinds of messages under that intent and add them back to the training data. Compare the intent accuracy for that particular intent with and without the augmented data.
- Compare the overall intent accuracy with and without the augmented data in the above scenario.
- Try generating paraphrases for different kind of sentences - short and long sentences, sentences with multiple intents in them, chitchat sentences, etc.
- Compare the performance of response selector before and after augmenting the training data for some FAQ style user messages or any training data that you have for it.