Paraphrasing for NLU Data Augmentation[Experimental]

tn.son · May 13, 2020, 7:31am

With WhiteSpace tokenizer and CountVectorsFeaturizer, We generated more data and chatbot’s classification performance increases significantly.
When we have pre-trained BERT embedding, we are arguing about 2 approaches:

Generate more data with fine-tuning BERT language model, then feed into data-training to improve classification performance?
No need generating more training data because NLU with pre-trained BERT embedding can understand sentence and similar sentences already.

How do you think about it? Which approach is better?

dakshvar22 · May 13, 2020, 7:46am

@tn.son Regarding your question on BERT, here is one quantitative evaluation evidence on how data augmentation can still be useful with BERT in the pipeline - Paraphrasing for NLU Data Augmentation[Experimental] - #4 by JulianGerhard

No need generating more training data because NLU with pre-trained BERT embedding can understand sentence and similar sentences already.

Pre-trained BERT is good at understanding similar word contexts but actually not that great at understanding similar sentences. Hence doing a sentence level data augmentation still makes sense.

tn.son · May 14, 2020, 4:06am

Thank you very much!

tn.son · May 28, 2020, 1:51am

Could I ask some questions?

When do your team plan to release this feature for Rasa?
Currently, this feature only supports for English, could you share how to fine-tune GPT-2 for this task?
For other language like Vietnamese, we don’t have kind of [ParaNMT-50M dataset]. So I think about create the same ParaNMT for Vietnamese using Google Translate from [ParaNMT-50M dataset] or back translation. How do you think about this idea?
Could you suggest for me some ways to do this job in Vietnamese or non-English?

nuszki · June 15, 2020, 10:16pm

Thanks @dakshvar22, this is very helpful. Just a tiny bug: since June 3rd when transformers 2.11.0 came out the model initialising throws an error, so maybe just change install code to transformers==2.10.0?

dakshvar22 · June 16, 2020, 5:46am

@nuszki Thanks for pointing it out. I have pinned the version and should now work.

vishu1994 · January 29, 2021, 9:50am

Is there any link to download this model.Downloading this from colab is very time consuming.

iamsid · September 8, 2021, 4:42am

@dakshvar22 hello, is it now available for non-English models.

natalieEn · May 13, 2022, 8:51am

Hi @dakshvar22,

the download link in colab doesn’t work for me. I have the same problem as @indranil180 but even the new link you posted as a reply doesn’t work. is there any other way or link to get the model?

AravindNani · November 8, 2022, 7:49am

I am also facing the same issue as you @dakshvar22 could you please help us. We are getting this error when running the model.

Archive: model.zip End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of model.zip or model.zip.zip, and cannot find model.zip.ZIP, period.

Please help us in resolving this issue.

vaidehi · January 3, 2023, 4:52am

Hi Team , @alexweidauer Please help us we are having issue in using this colab.

And it been running since long as well as far as I see in this post thread

Topic		Replies	Views
Sentence generator Rasa Open Source	1	1083	August 26, 2020
Upper limit / max count for paraphrased questions under retrieval intents? Rasa Open Source	6	482	July 7, 2021
Rasa x http api for add multiple sentence for one intent [Deprecated] Rasa X Community Edition	1	346	March 29, 2021
New intents and entities while training NLU [Deprecated] Rasa X Community Edition	1	630	September 25, 2019
Use Rasa-X to generate NLU intent/entity data only - no conversations [Deprecated] Rasa X Community Edition	1	911	June 17, 2019

Paraphrasing for NLU Data Augmentation[Experimental]

Related topics