Paraphrasing for NLU Data Augmentation[Experimental]

  • With WhiteSpace tokenizer and CountVectorsFeaturizer, We generated more data and chatbot’s classification performance increases significantly.
  • When we have pre-trained BERT embedding, we are arguing about 2 approaches:
  1. Generate more data with fine-tuning BERT language model, then feed into data-training to improve classification performance?

  2. No need generating more training data because NLU with pre-trained BERT embedding can understand sentence and similar sentences already.

How do you think about it? Which approach is better?

@tn.son Regarding your question on BERT, here is one quantitative evaluation evidence on how data augmentation can still be useful with BERT in the pipeline - Paraphrasing for NLU Data Augmentation[Experimental] - #4 by JulianGerhard

No need generating more training data because NLU with pre-trained BERT embedding can understand sentence and similar sentences already.

Pre-trained BERT is good at understanding similar word contexts but actually not that great at understanding similar sentences. Hence doing a sentence level data augmentation still makes sense.

2 Likes

Thank you very much!

Could I ask some questions?

  • When do your team plan to release this feature for Rasa?
  • Currently, this feature only supports for English, could you share how to fine-tune GPT-2 for this task?
  • For other language like Vietnamese, we don’t have kind of [ParaNMT-50M dataset]. So I think about create the same ParaNMT for Vietnamese using Google Translate from [ParaNMT-50M dataset] or back translation. How do you think about this idea?
  • Could you suggest for me some ways to do this job in Vietnamese or non-English?

Thanks @dakshvar22, this is very helpful. Just a tiny bug: since June 3rd when transformers 2.11.0 came out the model initialising throws an error, so maybe just change install code to transformers==2.10.0?

@nuszki Thanks for pointing it out. I have pinned the version and should now work.

Is there any link to download this model.Downloading this from colab is very time consuming.

@dakshvar22 hello, is it now available for non-English models.

Hi @dakshvar22,

the download link in colab doesn’t work for me. I have the same problem as @indranil180 but even the new link you posted as a reply doesn’t work. is there any other way or link to get the model?

I am also facing the same issue as you @dakshvar22 could you please help us. We are getting this error when running the model.

Archive: model.zip End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of model.zip or model.zip.zip, and cannot find model.zip.ZIP, period.

Please help us in resolving this issue.

Hi Team , @alexweidauer Please help us we are having issue in using this colab.

And it been running since long as well as far as I see in this post thread