Learn how to make BERT smaller and faster

Hey everyone,

We’ve released a new blog post about compressing huge neural language models such as BERT: Learn how to make BERT smaller and faster. Edit: there is now another one about pruning BERT in particular.

Even though the blog post is aimed primarily at ML researchers and practitioners, the topic is very much relevant to everyone who wants to use today’s best-performing language models for tasks like intent classification.

If you’ve got any questions, ideas, comments, post them here! :slight_smile:


Will your next post be on pruning etc, I would love to see how I can implement this directly into my Rasa tests. I have been learning a lot from directly using Rasa, fixing issues, learning as I go. So it would be cool to see how I can utilize this

Great post!

Hey @FelixKing, thanks!

I have tried weight pruning and am currently exploring neuron pruning. The results look promising so far, though any inference time improvements are still to be measured. If you’re not afraid of dirty code, you can just watch my branch of the rasa repo for live updates :wink:

What is your motivation when it comes to model compression? Do you just want to see if big models can be made faster, or is it something else that’s driving your interest?

Is there a way i can incorporate/configure/enable compression in rasa (i am thinking config.yml or a combination of such)

Thank you Cyril

Hi @cyrilthank,

maybe you might want to think about this article from spacy in which distillation is properly explained. I’ve made really good experiences with packaging a e.g. finetuned BERT model to a spaCy model which could problem-free be imported into rasa, e.g. by using the following pipeline:

language: de
 - name: SpacyNLP
   case_sensitive: 1
   model: de_pytt_bertbasecased_lg_jg
 - name: SpacyTokenizer
 - name: SpacyFeaturizer
 - name: SklearnIntentClassifier

The results are really good and there is currently no lack of performance.

Regards Julian

In addition to @JulianGerhard’s advice, I should say that a model pruning blog post is coming soon. I am also making the code for a BERT-based intent classifier (which supports weight/neuron pruning) significantly easier to use.

That being said, Rasa probably won’t officially ship model compression techniques in the near future, but the code I wrote should serve as a good, re-usable example in case you want to apply quantisation, weight pruning or neuron pruning to a model.

What kind of model are you trying to compress, @cyrilthank?

Hi @cyrilthank, @SamS,

I have created a bert_spacy_rasa repo in which I describe a quick dive into the matter. The distillation part however needs to be added and maybe we could collaborate at this point.

I keep this repo pretty much up-to-date so if you want to share the code @SamS I can use it.

My idea behind this repo was to make life easier for those, who want to follow the track and don’t want to or don’t have the time to puzzle everything together.

Feel free to share your thoughts!


Thanks much @JulianGerhard I am going to start hacking through this.

Can you please let me know where i may reach out to you for ‘embarrassingly silly questions’ ?

Thank you


Hi @cyrilthank,

I think silly questions don’t exist - in contrary - if you have them, others might have them too - so I suggest to post them here!

Usually, I am rather quick! :slight_smile:

Regards Julian

Thanks @JulianGerhard looks like you really want to embarrass me :wink:

But here goes…i see your dataset is for category wise classification.

can you please share any similar datasets (not necessarily within the classification domain) but in Entity/Intent recognition ‘framework’ where we may try this?

Requesting since i am assuming from the below this is for a bot which is classifiying text while i am stuck with Entity/Intent recognition dataset issues

	"text": "<any article you want to get its domain for>"

Hi @cyrilthank,

correct - this dataset is not optimal since it cannot directly be seen as a conversational AI related dataset but it fitted my needs:

  1. It is large enough to be representative for a finetuning evaluation task
  2. It has only a few classes which means that if a class is seen as an “intent” in rasa, it can easily be integrated since there is only minor manual effort necessary
  3. Imagine the bot as a classifier… you want to know the domain for a given article and the bot utters its domain - seems to be fair enough for the moment - doesnt it?

One thing is to be mentioned here: The current spaCy packaged models are not capable of providing directly extracted entities. It has something to do with the finetuning format for entity-transfer-learning. I am currently working on this and will update the repo asap.

I have solved that by using two models in the same pipeline for rasa.

If this doesnt answer your question, please describe a bit more detailed what you want to achieve!

Regards Julian

Thanks Julian

Sorry i think i got the words wrong. I didnt at all mean to say your dataset is not optimal.

Frankly i am still hoping to ‘force-fit’ your work into my domain-specific Entity/Intent recognition scheme to leverage BERT ‘in some form’ :wink:

I see in the latest reply you mentioned you have solved this problem “…using two models in the same pipeline for rasa”

Can you please share about this? Please let me know if i need to create another issue/conversation for this

Thank you


Hi @cyrilthank,

eveythings OK - dont worry! Since most of our own datasets are compliance-secured, I couldn’t use one of those. I needed a free one and saw that the DeepSet team used the same GNAD for evaluating their german pretrained BERT - so I decided to “missuse” it.

Of course I can do that. As soon as I realized that I won’t be able to use the finetuned BERT-spaCy model in rasa for e.g. extracting entities like PERSON (in fact, duckling is currently not able to do that), I thought about how this would be done in general:

  1. Use the SpacyFeaturizer and SpacyEntityExtractor which currently would be recommended but which is not possible due to manual effort on the side of BERT (as mentioned, I am working on that).

  2. Finetuning the pretrained BERT that afterwards is converted into a spaCy-compatible model on any NER dataset is absolutely possible and intended. We can finetune the BERT on both tasks alongside. If so, the model contains everything we are going to need to derive entities from it. Currently just not with spaCy directly. Instead we could use a CustomBERTEntityExtractor which loads the model that the pipeline already has loaded and do the work, that spaCy is currently not “able” to do.

  3. Since 2 seems to be an overhead at least for the moment, why not do the following:

language: de
 - name: SpacyNLP
   case_sensitive: 1
   model: de_pytt_bertbasecased_lg_gnad
 - name: SpacyTokenizer
 - name: SpacyFeaturizer
 - name: SklearnIntentClassifier
 - name: SpacyNLP
   case_sensitive: 1
   model: de_core_news_md
 - name: RegexFeaturizer
 - name: CRFEntityExtractor
 - name: DucklingHTTPExtractor
   dimensions: ['time', 'duration', 'email']
   locale: de_DE
   timezone: Europe/Berlin
   url: http://localhost:8001
 - name: SpacyEntityExtractor
   dimensions: ['PER', 'LOC', 'CARDINAL']
 - name: rasa_mod_regex.RegexEntityExtractor
 - name: EntitySynonymMapper

This pipeline will then load and use the features of de_pytt_bertbasecased_lg_gnad for SklearnIntentClassifier, and the features of de_core_news_md for SpacyEntityExtractor.

This is not a neat solution and it should only be used until there is a smarter way (1,2) but it works.

It should be mentioned, that of course you are able to finetune even the de_core_news_md model of spaCy or train your own.

Did that help you?

Regards Julian

Definitely this is helping me think through this better.

Sorry i am getting spoilt here by your ‘instant replies’ but it is helping me think through this.

Question: From a pipeline/workflow perspective. if i were to fine-tune BERT for Entity Recognition/Intent classification using domain-specific news data (ie for example the data is already picked up from weather site and we dont need to classify weather news separately) What would the steps look like for with spaCy and without spaCy

Please feel free to respond later and not feel rushed

Thank you for your patient replies


Hi @cyrilthank,

Ill describe that in detail tomorrow morning! Ill edit this Post then.

Regards Julian

Thanks a lot @JulianGerhard for your patient replies. Much appreciate your efforts.

Hi @cyrilthank,

this can’t be answered in one single reply. I assume that you are familiar with the history of word embeddings and what to do with them so I will skip that part. If you know what they are capable of, then you should ask yourself: Do I need their advantages? I am not really sure about your use case but I’ll try:

  1. Yes, you would be able to finetune BERT on a domain specific news-data set, if there is enough data to let BERT learn from it. This can be done either only by finetuning BERT (there are several very good scripts on the HuggingFace repo for that) or by doing it with the spacy-pytorch-transformers library. The latter will allow you to follow the steps described in my repo.

  2. The second question is about what you want to do with that finetuned BERT. If you want to use it as a classifier, you have two choices: You could either train/use the finetuned BERT as a classifier directly (e.g. following this one). Or you could provide its features to the next algorithm that could use them, e.g. by packaging it with spaCy and e.g. by using a supervised embeddings config from rasa. If you want to use it for entity extraction its pretty much the same: Either use it directly or use it in a spaCy pipeline - the caveats of this approach are described in my repo and here.

If I may cite you:

“(ie for example the data is already picked up from weather site and we dont need to classify weather news separately)”

I don’t quite get this part but it seems you want to pick that weather data and extract “entities” from it?

I hope that helped!

Regards Julian

you got it absolutely right!

I want to
a. use weather data b. fine-tune it into bert using the ‘spacy route’ you mentioned in point 1 of your answer c. so that rasa can pick up ‘oh it is hot’ as a ‘weather_intent’ and not as a ‘spice_intent’

Can you please advise based on your extensive experience what may be the steps (from a pipeline/workflow perspective) to achieve that?

Hi Sam

By “…model pruning blog post…” did you mean this one

Hi Julian

Sorry for asking.

But is there a way i can ask you for the steps you used so i can follow similar steps for my use case? or would that be an infringement?

Thank you