Paraphrasing for NLU Data Augmentation[Experimental]

Hi @dakshvar22,

awesome feature.

I finished my first draft of experiments and want to hand over the collected feedback. I divided my experiments in two parts.

The first part was done by myself and the second one with an identical test scenario was done by a colleague of mine who is absolutely unexperienced in terms of AI - I was hoping for feedback from a neutral side.

We tested:

  1. Short sentences with only a few meaningful words and one exact intent
  2. Longer sentences with only a few meaningful words and one exact intent
  3. Short and longer sentences with lots of meaningful words and one exact intent
  4. The same procedure for two and more intents
  5. We took the sentences from 1 and used the meaningful words as stop words to check if we could extract sentence-“pattern” for intents
  6. Different settings for the amount of producible sentences
  7. Chitchat sentences

For the sake of simplicity I consolidate the results for both of us now. The figures in brackets indicate the examples to be used. The values are a bit “considerable” since my colleague and I didn’t come to a consensus in terms of semantics for every sample :smiley: we chose to be very strict then such that the results can be properly interpreted.

Results:

  1. Examples sentences e.g.:
  • I want to book a table at a restaurant (8/15)
  • I want to order a pizza (12/15)
  • I want to know where the next Park is (1/15)
  1. Examples sentences e.g.:
  • I want to make a visit at the chamber of secrets with all of my friends. (3/15)
  • How can I withdraw money from my account can you help me please? (4/15)
  • I need to know your email address please give it to me (12/15)
  1. Examples sentences e.g.:
  • I need 50 dollars from my primary bank account in the state of Miami. (3-4/15)
  • I want to order three pizzas, two with chicken and onion and one with salami, oregano and tuna. (5/15)
  • Can you please book a flight from Chicago to Washington at 2 am Sunday. (7/15)
  • I would like to check the status of my Amazon order from 15.01.2018 (11/15)
  • I need urgent help from an electrician because the lamp in my living room no longer works. (6/15)
  1. Example sentences e.g.:
  • I want to order a pizza for two and I want to book a flight from Miami to Chicago. (3/15)
  • I want to make an appointment and need help with my bank account (6/15)
  • I want to order two pizzas and a laptop from amazon (10/15)
  • Can you tell me how to book a flight and search my keys for me? (4/15)
  1. Example sentences e.g.:
  • I want to order a [pizza] (5/15) – memorable: “No I have to order a fucking pie” :smiley:
  • I want to book a [table] at a [restaurant] (6/15)
  1. Example sentence e.g.:
  • I would like to check the status of my bank account. [amount: 50] (~25/50)
  1. Example sentences e.g.:
  • Hand me over the canvas please (3/15)
  • How old are you (4/15)
  • What is your name (2/15)

Peculiarities:

Overall the system performs quite good. My perspective was to enhance existing datasets by samples that didn’t came to my mind yet. This worked out very well depending on my intent. First I thought that the longer the sentences will become the higher the decrease of quality – this assumption was wrong. However, the number of meaningful words seem to corelate with the quality of the outcome. That’s expected I guess since the system can’t perform paraphrasing on intent level (at least that’s what I assume) rather than on context level. I also observed that the system (also expected) works pretty good the nearer the intents come to the domains of the pretrained dataset.

After examining what happens to the order of words in a sentence I started to think that the german task would be a bit more complex since the german grammar is a bit more delicate than the English one. However, the samples that I chose to be added to my intent were really good if I picked them in terms of overall quality – meaning that you did not simply shuffle the words in a different order.

My assumption that the system won’t perform good on multi intent was wrong but after I started to think about the reason it became quickly clear that if it is working on short context ranges, why shouldn’t it perform good on multi intents? However we need to keep in mind that most of my samples simply concatenated the intents by using conjunctions. Excluding the meaningful words to extract patterns went total nuts – that was expected since if my assumption of the context based system is right, I am on purpose removing the relevant words in this case. What’s left can be that generic, that the system has to try “something”. However I observed that the words could not be totally removed since the outcome nevertheless includes them.

This is either because of some estimation that actually lead to the correct predictions or it was simply coincidence. The approach “more is more” actually worked out since I ended up with way more useful examples than expected.

Evaluations

Pipelines

Config_alt:

    pipeline:
     - name: HFTransformersNLP
       model_name: "bert"
       model_weights: "bert-base-cased"
     - name: "LanguageModelTokenizer"
       # Flag to check whether to split intents
       "intent_tokenization_flag": False
       # Symbol on which intent should be split
       "intent_split_symbol": "+"
     - name: "LanguageModelFeaturizer"
     - name: LexicalSyntacticFeaturizer
     - name: CountVectorsFeaturizer
       strip_accents: "unicode"
     - name: CountVectorsFeaturizer
       strip_accents: "unicode"
       analyzer: "char_wb"
       min_ngram: 2
       max_ngram: 15
     - name: DIETClassifier
       intent_classification: True
       entity_recognition: False
       use_masked_language_model: False
       number_of_transformer_layers: 2

Config:

pipeline:
 - name: SpacyNLP
   model: en_trf_bertbaseuncased_lg_tt
 - name: SpacyTokenizer
   intent_tokenization_flag: true
   intent_split_symbol: "+"
 - name: CountVectorsFeaturizer
   strip_accents: "unicode"
 - name: CountVectorsFeaturizer
   strip_accents: "unicode"
   analyzer: "char_wb"
   min_ngram: 2
   max_ngram: 15
 - name: SpacyFeaturizer
 - name: DIETClassifier
   intent_classification: True
   entity_recognition: False
   use_masked_language_model: False
   BILOU_flag: False
   number_of_transformer_layers: 0

Evaluation

5-fold cross validation with 3 runs

Dataset 1

15 different intents (semantically wide spread)

10 samples per intent

Dataset 2

45 different intents (semantically close)

50 samples per intent

Dataset 3

55 different intents (both semantically wide spread and close)

75 – 100 samples per intent

Scenario 1

10 out of 10 paraphrased samples were added blindly

Scenario 2

20 samples were paraphrased, on average 7 of them were carefully chosen added

Config_alt.yml

Scenario 1 - Dataset 1

Non augmented macro avg F1: 0.79

Augmented: 0.80

Scenario 1 – Dataset 2

Non augmented macro avg F1: 0.84

Augmented: 0.84

Scenario 1 – Dataset 3

Non augmented macro avg F1: 0.87

Augmented: 0.90

Scenario 2 – Dataset 1

Non augmented macro avg F1: 0.79

Augmented: 0.91

Scenario 2 – Dataset 2

Non augmented macro avg F1: 0.84

Augmented: 0.92

Scenario 2 – Dataset 3

Non augmented macro avg F1: 0.87

Augmented: 0.94

Config.yml

Scenario 1 - Dataset 1

Non augmented macro avg F1: 0.85

Augmented: 0.87

Scenario 1 – Dataset 2

Non augmented macro avg F1: 0.91

Augmented: 0.91

Scenario 1 – Dataset 3

Non augmented macro avg F1: 0.96

Augmented: 0.92

Scenario 2 – Dataset 1

Non augmented macro avg F1: 0.85

Augmented: 0.94

Scenario 2 – Dataset 2

Non augmented macro avg F1: 0.91

Augmented: 0.92

Scenario 2 – Dataset 3

Non augmented macro avg F1: 0.96

Augmented: 0.97

Personally I think that the results are not surprising – I am satisfied with them, however the configs have to be kept in mind. The config.yml is one that is currently used in production. I am playing around with the new HFTransformers Feature and things like that.

Kind regards
Julian

8 Likes