Paraphrasing for NLU Data Augmentation[Experimental]

Problem

As developers start building their assistant, one of the major critical tasks is to add training data for all intents they plan to cover. In order to ensure their assistant is able to handle real world conversations, they need to anticipate different variations in which their users can express the same intent. The more varied training data we add, the hope is that the NLU models become more robust to incoming user messages. We think it will be helpful to have an interactive data augmentation tool which helps a developer in coming up with these variations in order to augment their training data with these varied paraphrases.

Experimental Solution

We built a paraphrasing model, which takes as input a sentence and generates multiple paraphrased versions of the same sentence.

For example:

input message: I want to book an appointment for tomorrow.
 
Generated Paraphrases
---------------------
 
- I want to book an appointment for tomorrow.
- i need to book an appointment for another meeting.
- i want to make an appointment for tomorrow.
- i want to book a spot tomorrow.
- i want to book a reservation for tomorrow.
- we need to book an appointment for tomorrow.
- i want to book an appointment for tomorrow at 11 : 00.
- i 'd like to meet tomorrow.
- i 'd like you to arrange a session for tomorrow.
- i 'll book a reservation for tomorrow.
- i 'd like to book the appointment for tomorrow.
- i 'd like to make an appointment for tomorrow.

As you can see, since some of them are grammatically and semantically meaningful, we can build an interactive tool in Rasa Open Source or Rasa X where developers can pick and choose variations they like and then add them to their existing training data.

How to use it?

Since this is an experimental model as of now, we haven’t tightly integrated it as a feature inside Rasa Open Source. If you would like to play with it, a demo is available as a Google colab notebook.

  1. Input a sentence for which you would like to generate paraphrases.
  2. Once the model generates the paraphrases, select the ones you like and group them under an intent.
  3. As an output you get all the selected paraphrases grouped under the selected intent formatted in the markdown training data format. You can copy the formatted output back to your training data file.
  4. Re-run model training and evaluation to see the augmented data helps in NLU tasks.

Additionally, you can also control the generation of a paraphrase by entering some stop words that you would like the model to not include in its paraphrases. For example, if my input message is - I would like to book an appointment for tuesday, I would want to generate a few variations of the same sentence but without the word appointment in them. You can provide multiple such stop words. We found that this is a neat trick to generate more variations in the paraphrases.

Note: The current model can handle input messages in English only as it is trained on an English corpus.

How to contribute?

We would love to hear your feedback on whether augmenting your NLU data with paraphrases from this model helps in improving the performance of your NLU models. Specifically, these would be some interesting scenarios to try -

  1. Pick one intent which has low number of training examples(say 10). Generate paraphrases for different kinds of messages under that intent and add them back to the training data. Compare the intent accuracy for that particular intent with and without the augmented data.
  2. Compare the overall intent accuracy with and without the augmented data in the above scenario.
  3. Try generating paraphrases for different kind of sentences - short and long sentences, sentences with multiple intents in them, chitchat sentences, etc.
  4. Compare the performance of response selector before and after augmenting the training data for some FAQ style user messages or any training data that you have for it.
20 Likes

Very interesting! I’ve always found data augmentation in NLP quite tricky, especially with NER. Are there details of your model/training data anywhere or that is private?

1 Like

Does it work for other languages than English?

2 Likes

Hi @dakshvar22,

awesome feature.

I finished my first draft of experiments and want to hand over the collected feedback. I divided my experiments in two parts.

The first part was done by myself and the second one with an identical test scenario was done by a colleague of mine who is absolutely unexperienced in terms of AI - I was hoping for feedback from a neutral side.

We tested:

  1. Short sentences with only a few meaningful words and one exact intent
  2. Longer sentences with only a few meaningful words and one exact intent
  3. Short and longer sentences with lots of meaningful words and one exact intent
  4. The same procedure for two and more intents
  5. We took the sentences from 1 and used the meaningful words as stop words to check if we could extract sentence-“pattern” for intents
  6. Different settings for the amount of producible sentences
  7. Chitchat sentences

For the sake of simplicity I consolidate the results for both of us now. The figures in brackets indicate the examples to be used. The values are a bit “considerable” since my colleague and I didn’t come to a consensus in terms of semantics for every sample :smiley: we chose to be very strict then such that the results can be properly interpreted.

Results:

  1. Examples sentences e.g.:
  • I want to book a table at a restaurant (8/15)
  • I want to order a pizza (12/15)
  • I want to know where the next Park is (1/15)
  1. Examples sentences e.g.:
  • I want to make a visit at the chamber of secrets with all of my friends. (3/15)
  • How can I withdraw money from my account can you help me please? (4/15)
  • I need to know your email address please give it to me (12/15)
  1. Examples sentences e.g.:
  • I need 50 dollars from my primary bank account in the state of Miami. (3-4/15)
  • I want to order three pizzas, two with chicken and onion and one with salami, oregano and tuna. (5/15)
  • Can you please book a flight from Chicago to Washington at 2 am Sunday. (7/15)
  • I would like to check the status of my Amazon order from 15.01.2018 (11/15)
  • I need urgent help from an electrician because the lamp in my living room no longer works. (6/15)
  1. Example sentences e.g.:
  • I want to order a pizza for two and I want to book a flight from Miami to Chicago. (3/15)
  • I want to make an appointment and need help with my bank account (6/15)
  • I want to order two pizzas and a laptop from amazon (10/15)
  • Can you tell me how to book a flight and search my keys for me? (4/15)
  1. Example sentences e.g.:
  • I want to order a [pizza] (5/15) – memorable: “No I have to order a fucking pie” :smiley:
  • I want to book a [table] at a [restaurant] (6/15)
  1. Example sentence e.g.:
  • I would like to check the status of my bank account. [amount: 50] (~25/50)
  1. Example sentences e.g.:
  • Hand me over the canvas please (3/15)
  • How old are you (4/15)
  • What is your name (2/15)

Peculiarities:

Overall the system performs quite good. My perspective was to enhance existing datasets by samples that didn’t came to my mind yet. This worked out very well depending on my intent. First I thought that the longer the sentences will become the higher the decrease of quality – this assumption was wrong. However, the number of meaningful words seem to corelate with the quality of the outcome. That’s expected I guess since the system can’t perform paraphrasing on intent level (at least that’s what I assume) rather than on context level. I also observed that the system (also expected) works pretty good the nearer the intents come to the domains of the pretrained dataset.

After examining what happens to the order of words in a sentence I started to think that the german task would be a bit more complex since the german grammar is a bit more delicate than the English one. However, the samples that I chose to be added to my intent were really good if I picked them in terms of overall quality – meaning that you did not simply shuffle the words in a different order.

My assumption that the system won’t perform good on multi intent was wrong but after I started to think about the reason it became quickly clear that if it is working on short context ranges, why shouldn’t it perform good on multi intents? However we need to keep in mind that most of my samples simply concatenated the intents by using conjunctions. Excluding the meaningful words to extract patterns went total nuts – that was expected since if my assumption of the context based system is right, I am on purpose removing the relevant words in this case. What’s left can be that generic, that the system has to try “something”. However I observed that the words could not be totally removed since the outcome nevertheless includes them.

This is either because of some estimation that actually lead to the correct predictions or it was simply coincidence. The approach “more is more” actually worked out since I ended up with way more useful examples than expected.

Evaluations

Pipelines

Config_alt:

    pipeline:
     - name: HFTransformersNLP
       model_name: "bert"
       model_weights: "bert-base-cased"
     - name: "LanguageModelTokenizer"
       # Flag to check whether to split intents
       "intent_tokenization_flag": False
       # Symbol on which intent should be split
       "intent_split_symbol": "+"
     - name: "LanguageModelFeaturizer"
     - name: LexicalSyntacticFeaturizer
     - name: CountVectorsFeaturizer
       strip_accents: "unicode"
     - name: CountVectorsFeaturizer
       strip_accents: "unicode"
       analyzer: "char_wb"
       min_ngram: 2
       max_ngram: 15
     - name: DIETClassifier
       intent_classification: True
       entity_recognition: False
       use_masked_language_model: False
       number_of_transformer_layers: 2

Config:

pipeline:
 - name: SpacyNLP
   model: en_trf_bertbaseuncased_lg_tt
 - name: SpacyTokenizer
   intent_tokenization_flag: true
   intent_split_symbol: "+"
 - name: CountVectorsFeaturizer
   strip_accents: "unicode"
 - name: CountVectorsFeaturizer
   strip_accents: "unicode"
   analyzer: "char_wb"
   min_ngram: 2
   max_ngram: 15
 - name: SpacyFeaturizer
 - name: DIETClassifier
   intent_classification: True
   entity_recognition: False
   use_masked_language_model: False
   BILOU_flag: False
   number_of_transformer_layers: 0

Evaluation

5-fold cross validation with 3 runs

Dataset 1

15 different intents (semantically wide spread)

10 samples per intent

Dataset 2

45 different intents (semantically close)

50 samples per intent

Dataset 3

55 different intents (both semantically wide spread and close)

75 – 100 samples per intent

Scenario 1

10 out of 10 paraphrased samples were added blindly

Scenario 2

20 samples were paraphrased, on average 7 of them were carefully chosen added

Config_alt.yml

Scenario 1 - Dataset 1

Non augmented macro avg F1: 0.79

Augmented: 0.80

Scenario 1 – Dataset 2

Non augmented macro avg F1: 0.84

Augmented: 0.84

Scenario 1 – Dataset 3

Non augmented macro avg F1: 0.87

Augmented: 0.90

Scenario 2 – Dataset 1

Non augmented macro avg F1: 0.79

Augmented: 0.91

Scenario 2 – Dataset 2

Non augmented macro avg F1: 0.84

Augmented: 0.92

Scenario 2 – Dataset 3

Non augmented macro avg F1: 0.87

Augmented: 0.94

Config.yml

Scenario 1 - Dataset 1

Non augmented macro avg F1: 0.85

Augmented: 0.87

Scenario 1 – Dataset 2

Non augmented macro avg F1: 0.91

Augmented: 0.91

Scenario 1 – Dataset 3

Non augmented macro avg F1: 0.96

Augmented: 0.92

Scenario 2 – Dataset 1

Non augmented macro avg F1: 0.85

Augmented: 0.94

Scenario 2 – Dataset 2

Non augmented macro avg F1: 0.91

Augmented: 0.92

Scenario 2 – Dataset 3

Non augmented macro avg F1: 0.96

Augmented: 0.97

Personally I think that the results are not surprising – I am satisfied with them, however the configs have to be kept in mind. The config.yml is one that is currently used in production. I am playing around with the new HFTransformers Feature and things like that.

Kind regards
Julian

8 Likes

Hi, the download link in colab doesnt seem to work for me :pensive:

https://docs.google.com/uc?export=download

When i tried to hit the URL in browser getting 400 Bad Request. Can anyone help me with this ?

@Zylatis Thanks for the comment. We have fine-tuned a generative model(GPT-2) for this particular task. The input to the model is the sentence to be paraphrased and the model is expected to generate a paraphrased variant of the same sentence. We used a smaller subset(about 5M) of the complete ParaNMT-50M dataset

2 Likes

@neves As the post mentions, since it is trained only on an English corpus, it works only for input sentences in English language.

@indranil180 The link that you posted seems to be the wrong link. I rechecked and it should work if you try downloading the notebook from this link. Thanks

Hah! I was thinking about building something like this. I like the idea but I have some feedback.

Entities

It would be nice if this method is robust against entities.Take this starting spot;

how do i catch [pichachu](pokemon)

When I now mention that I want to change “catch” then this is what I get;

## intent:catch
- how do i catch [pichachu](pokemon)
- how to do [ pichachu ]
- how do i get [ pichacho ]
- how do i find [ pichachu ]
- how to get [ pichachu ]
- how to capture [ [ pichachu ] ]
- how do you know [ pichachu ]
- what i'm doing is catching [ pichachu ]
- how do i get [ pichachu ]
- how do i get [ pichachu ]?
- how do i pick [ pichachu ]
- how do i get [ kikakure ]
- how do i get a [ pichachu ]?
- how do i get... pichachu... ( pokemon )
- how do i grab [ pichachu ]
- how to get [ pokemon ]

First; I like some of these changes and I understand it wasn’t the immediate goal to support this, but it would be nice if it just kept [picachu](pokemon) intact. If the entity is removed, I get this;

## intent:capture
- how do i catch picachu
- and so the story of picac is : why, why, my dear, my dear, my
- so i'm gon na try catching picachu
- what about picachu?
- what to know, i 'll find picachu
- why can i capture picachu?
- then i have to grab picacho.
- and i'm looking for picachu
- what should i be catching by picak
- so, what will you tell me
- what to get picacachu
- why i want picachi
- so i can take picachu
- where to find picachina

Multiple Inputs

If I now choose to replace “how”, “catch” and “do” I get these results;

## intent:catch
- how do i catch [pichachu](pokemon)
- how to do [ pichachu ]
- how do i get [ pichacho ]
- how do i find [ pichachu ]
- how to get [ pichachu ]
- how to capture [ [ pichachu ] ]
- how do you know [ pichachu ]
- what i'm doing is catching [ pichachu ]
- how do i get [ pichachu ]
- how do i get [ pichachu ]?
- how do i pick [ pichachu ]
- how do i get [ kikakure ]
- how do i get a [ pichachu ]?
- how do i get... pichachu... ( pokemon )
- how do i grab [ pichachu ]
- how to get [ pokemon ]
- and we'il get [ pichachu ]
- so i'm gon na find myself at [ pichachu ] ( p. pokemon )
- what if i caught [ pichachu ]
- what to use [ pichachu ]
- why can i capture [ pichachu )
- what am i supposed to say... what i'm supposed to say... what i didn't
- to grab [ pichachu ]
- what i'm gon na try to bring [ pichachu ]
- what can i get from [ pichachu ( pokash ) }
- why i'm catching [ pichachu ]
- what's the best way to get [ pichachu ]
- where to find [ pichachu ]
- what to get [ pichachu ]?
- where can i find [ pichachu ]
- where can i find a [ pichachu ]?
- what can i get? [ pichachu ( pokemon )
- what to use [ pokemon ]

There’s a few duplicates in there, which is a bummer but I also wonder if it were possible to have multiple inputs to the system. With only one sentence, there’s a chance that it goes outside of scope. If instead it hard a list of intents that I allready have then it might be able to generate some ground that is not already covered better.

User Interface

I think I would not use this tool as the only source of data. Instead I would use it to append the candidates that I like. This might be nice for a streamlit app (wink wink) to play with.

Simple Benchmark

The transformer approach has merit to it. But for me it would be easier to properly evaluate it if I could compare against a simple thesaurus lookup.

1 Like

@JulianGerhard As I can see in your config you used the German BERT model for intent classification, but the paraphrasing only works for english sentences. So I guess you translated them manually before feeding them into you rasa training base. Am I right?

Hi @dakshvar22

This will be a boon for low resource languages. I would like to implement this for Indian languages. Starting with Hindi. How can I get started to build a model for hindi ?

How can we include entities in the paraphrases?

Hi @lindig,

thanks for pointing that out. I simply messed up the configs. I am juggling with english and german on my device and picked the wrong set of configs but the evaluations are according to the english configs I just posted.

I did not try translating the sentences like in classical augmentation approaches but are you interested in an evaluation based on this approach?

Kind regards
Julian

@vishu1994 Right now the model wouldn’t handle entities well. My suggestion for now would be to input messages without entities, collect the paraphrases and re-annotate the entities yourself.

1 Like

@AvinashBenki I agree this would be really useful for low resource languages like Hindi. You will need a large dataset of paraphrases similar to this one.

Is it possible to retrain the model with a different dataset to look for domain specific results?

My bot speaks German and as always in an early stage of developing a chatbot I struggle to produce a big amount of different training data. As far as I know Chatito also understands just English.

I used the intent examples on the Sara demo bot to translate a few intents like ask_builder just with the Google translater. This worked good all in all. A few examples I had to remove due to wrong translation, but this way I got a lot new training data. It would be very interesting to try this approach also for this.

If I have time in the next weeks I definetly will try to translate the augmentation results into German and check the results.

1 Like

@dakshvar22 Thank you for introducing this function. I think this is an important feature. I also implemented this feature for my chatbot in Vietnamese and accuracy of intent increased significantly. However, I am think about how to evaluate performance of the module. Do you have any idea for it?

Hi @tn.son, these would be some interesting scenarios to try -

  1. Pick one intent which has low number of training examples(say 10). Generate paraphrases for different kinds of messages under that intent and add them back to the training data. Compare the intent accuracy for that particular intent with and without the augmented data.
  2. Compare the overall intent accuracy with and without the augmented data in the above scenario.
  3. Try generating paraphrases for different kind of sentences - short and long sentences, sentences with multiple intents in them, chitchat sentences, etc.
2 Likes

Nice example, and I’m also interested in this specifically for sentences containing entities. If I could wave a magic wand I’d say we need a way to tell the GPT model to leave alone and then we can backfill from a list later on and randomly sprinkle entities into that slot ourselves (assumes independence between specific entity and rest of sentence, rather than just entity type but i think that’s reasonable).

If the model were generating sentences of fixed length i.e. only changing certain words probably we could get around that by postprocessing, e.g. input to model is

how do i catch <POKEMON>

it produces

how do i find grandma

and we replace grandma with <POKEMON> again and then just randomly assign entity examples. Of course, limiting to same length sentences is a serious limitation.

Is there any way from these models to determine the exact replacement taking place? If so we could use that, but i suspect that’s not possible.

For some context my application is (sometimes multiple) address extraction for online customer service chatbot. There are oodles of ways people talk about addresses and being able to interpolate between them with this model would be great.