Paraphrasing for NLU Data Augmentation[Experimental]

  • With WhiteSpace tokenizer and CountVectorsFeaturizer, We generated more data and chatbot’s classification performance increases significantly.
  • When we have pre-trained BERT embedding, we are arguing about 2 approaches:
  1. Generate more data with fine-tuning BERT language model, then feed into data-training to improve classification performance?

  2. No need generating more training data because NLU with pre-trained BERT embedding can understand sentence and similar sentences already.

How do you think about it? Which approach is better?

@tn.son Regarding your question on BERT, here is one quantitative evaluation evidence on how data augmentation can still be useful with BERT in the pipeline - Paraphrasing for NLU Data Augmentation[Experimental]

No need generating more training data because NLU with pre-trained BERT embedding can understand sentence and similar sentences already.

Pre-trained BERT is good at understanding similar word contexts but actually not that great at understanding similar sentences. Hence doing a sentence level data augmentation still makes sense.

1 Like

Thank you very much!

Could I ask some questions?

  • When do your team plan to release this feature for Rasa?
  • Currently, this feature only supports for English, could you share how to fine-tune GPT-2 for this task?
  • For other language like Vietnamese, we don’t have kind of [ParaNMT-50M dataset]. So I think about create the same ParaNMT for Vietnamese using Google Translate from [ParaNMT-50M dataset] or back translation. How do you think about this idea?
  • Could you suggest for me some ways to do this job in Vietnamese or non-English?

Thanks @dakshvar22, this is very helpful. Just a tiny bug: since June 3rd when transformers 2.11.0 came out the model initialising throws an error, so maybe just change install code to transformers==2.10.0?

@nuszki Thanks for pointing it out. I have pinned the version and should now work.

(post withdrawn by author, will be automatically deleted in 8 hours unless flagged)

(post withdrawn by author, will be automatically deleted in 8 hours unless flagged)