Feedback on ConveRT Model + Rasa NLU

Hey, Recently the folks at PolyAI have open-sourced a new sentence encoding model called ConveRT which is pretrained on a large conversational dataset and hence claims better conversational representations over traditional large language models like BERT, etc. The idea resonates very well with what we also believe and have been working on internally at Rasa. Since they open sourced their model as a TFHub model, we decided to build a quick featurizer based on this to extract representations and use them with downstream intent classification models already existing inside Rasa.

In our internal tests, the model does give a significant boost to intent classification accuracy on multiple datasets. We would love the community to try it out and share evaluation numbers on their test sets.

We have released the featurizer as part of Rasa 1.5.0. You can try by doing a pip install of Rasa-

pip install rasa==1.5.0

The featurizer uses an optional dependency tensorflow_text. Install it with -

pip install --no-deps tensorflow_text==1.15.1

Now you are ready to use the ConveRTFeaturizer . In the project directory, we recommend using a config along these lines -

language: en

  - name: WhitespaceTokenizer
  - name: ConveRTFeaturizer
  - name: EmbeddingIntentClassifier

This uses the ConveRT model as a feature extractor alone and we do not fine tune it along with our intent classifier. Please note that this model can only be used for a dataset in english language. Feel free to change the config params of EmbeddingIntentClassifier according to your dataset. It would be great if you can share the numbers for evaluation metrics on your test dataset here.


Is this language specific? i.e. does it only work on English vocabulary words? What happens when the corpus is heavily dominated by Out Of English Vocabulary words?

Hi @dakshvar22,

as far as I am currently experiencing, this is only possible on Linux machines - Windows seems to not be supported currently.

I’ll test it on a Linux machine on monday and give you feedback on one of our medium datasets with a really great quality. Any metric in specific that you are interested in?


@psds01 This is only for english language for now. Regarding out of vocabulary words, if you have loads of them and your dataset is dominated by them, this may not help you much but ConveRT does apply a clever trick for handling OOV words. So I would still suggest to give it a try.

@JulianGerhard that would be great! You can share the F1 score numbers that rasa test nlu command would give in the intent classification report file. Or even better, try a kfold cross validation with rasa test nlu -f 5 --cross-validation

Hi @dakshvar22,

I did three different experiments for now:

  1. Using the given pipeline on an english dataset with 5 distinct intents, 162 samples, doing a 5-fold cross validation with the following results:
test Accuracy: 0.916 (0.044)
test F1-score: 0.897 (0.048)
test Precision: 0.882 (0.049)

This result can be seen as really good because the dataset is really tricky in terms of mostly similar sentences leading to different intents.

  1. Using the given pipeline on an english dataset with 52 distinct intents, 3842 samples, doing a 5-fold cross-validation with the following results:
test Accuracy: 0.934 (0.004)
test F1-score: 0.916 (0.003)
test Precision: 0.911 (0.003)

This is a medium good result since it was a domain specific dataset for which the current baseline is a finetuned BERT which performs outstanding with your EmbeddingIntentClassifier.

  1. Using the given pipeline and setup on the german version of the set used in experiment #2
test Accuracy: 0.902 (0.004)
test F1-score: 0.902 (0.003)
test Precision: 0.910 (0.003)

I am wondering why this result is that good :smiley:

One additional note: If you start to combine the ConvertFeaturizer with either a CountVectorFeaturizer or a SpacyFeaturizer, sooner or later an error appears that complains about different shapes of the input. I think this is expected (as far as I read in your doc about ConvertFeaturizer) but has to be considered in the future, otherwise you would force people to renounce on e.g. domain specific scenarios built with n grams.

If you need something else, feel free to ask for.

Regards Julian

@JulianGerhard Awesome! Thanks for sharing extensive results. I have some questions on the experiments -

  1. Do you have baseline numbers on this dataset for supervised embeddings pipeline which just uses a word and char level countvectorizer and embedding intent classifier? It will be good to see if there are any gains.

  2. Can you share the results of your baseline(finetuned BERT with EmbeddingIntentClassifier) as well?

  3. This may come as a surprise but the original paper on ConveRT does handle out of vocabulary words in a clever fashion using a hashing trick :smiley: . Here, also it will be interesting to see the test results of supervised embeddings pipeline.

You are right, the experimental branch may break with using any other featurizer with convert at the moment. We wanted to look at what’s the best convert can do at the moment.

Regards, Daksh

@JulianGerhard My apologies, I interpreted the results for 3rd experiment wrongly. I thought we had sequence encoding instead of just sentence encoding. The results are indeed surprising. Nevertheless, it would be interesting to know the evaluation numbers for a baseline model.

Hi @dakshvar22,

of course you are welcome at any time. A simple supervised embedding config resulted in:

  1. dataset:
rasa.nlu.test  - test Accuracy: 0.927 (0.028)
rasa.nlu.test  - test F1-score: 0.908 (0.038)
rasa.nlu.test  - test Precision: 0.893 (0.047)
  1. dataset:
test Accuracy: 0.923 (0.004)
test F1-score: 0.931 (0.003)
test Precision: 0.927 (0.003)
  1. dataset (using “de”):
test Accuracy: 0.900 (0.004)
test F1-score: 0.910 (0.003)
test Precision: 0.913 (0.003)

So yes, there is a gain. There may be better results if I have fintuned the classifier a bit but I am lacking of time so I need to postpone that.

Regarding your BERT question the results are:

test Accuracy: 0.978 (0.004)
test F1-score: 0.963 (0.003)
test Precision: 0.962 (0.003)

After my last test the results were even better because I used another batch size / sequence length for BERT (128 -> 256) but if I put that into a 5-fold cv, we would still be waiting for the result :smiley: So just consider them around ~1.2% better for every metric.

Anything else that I could try? Other datasets with specific setups?

Regards Julian

1 Like

@JulianGerhard Thanks for sharing the numbers. Very interesting that simple supervised embedding config is on par or even a bit better in some cases.

1 Like

Convert was really cool off the shelf for chitchat/smalltalk. I haven’t tried it at a featurizer yet, but it worked great out-of-the-box as a response selector.

I’d love it if like spacy you could get multiple things out of it. For instance, if you could pass it a list of simple responses, and it could extract the best choice.

I imagine using it in a setup where there is one smalltalk or chitchat intent, and when it fires, you respond with the "best_smalltalk_response" on the message. Could be a great way of enabling plenty of smalltalk flexibility without having 10+ intents dedicated to it.

Nice work! Really cool to see ConveRT implemented in Rasa.

I noticed that the suggested pipeline includes a WhitespaceTokenizer- that might not be necessary as ConveRT does its own tokenization internally.

Also it would be cool to see the other signatures of the model made use of:

  • encode_context and encode_response to handle response selection / smalltalk as @jamesmf mentions. (maybe with the extra context model)
  • encode_sequence and tokenize to power entity/value extraction
1 Like

Hello all!

Here is my result on small size data. intent examples: 209 (11 distinct intents) entity examples: 80 (4 distinct entities)

Sklearn pipeline

rasa.nlu.test  - train Accuracy: 0.995 (0.002)
rasa.nlu.test  - train F1-score: 0.995 (0.002)
rasa.nlu.test  - train Precision: 0.995 (0.002)
rasa.nlu.test  - test Accuracy: 0.828 (0.075)
rasa.nlu.test  - test F1-score: 0.806 (0.081)
rasa.nlu.test  - test Precision: 0.813 (0.078)

ConveRT pipeline

rasa.nlu.test  - train Accuracy: 1.000 (0.000)
rasa.nlu.test  - train F1-score: 1.000 (0.000)
rasa.nlu.test  - train Precision: 1.000 (0.000)
rasa.nlu.test  - test Accuracy: 0.923 (0.024)
rasa.nlu.test  - test F1-score: 0.919 (0.025)
rasa.nlu.test  - test Precision: 0.937 (0.020)

@matthen WhitespaceTokenizer can be omitted if you want to classify intent, for entity extraction you should include


@maulikmadhavi Thanks for sharing the results. Can you also check how does the supervised embeddings pipeline do on this dataset?

@matthen Thanks for the suggestions. Those are all definitely additions in the near future.

We would have a separate tokenizer ConveRTTokenizer as an alternative to WhitespaceTokenizer. Currently the WhitespaceTokenizer can be used to generate tokens for other featurizers inside Rasa which users can use along with ConveRTFeaturizer, for e.g. - CountVectorsFeaturizer. Similarly, we have plans on utilizing the encode_sequence and encode_response very soon.

1 Like

Hi @dakshvar22 Supervised pipeline

train Accuracy: 1.000 (0.000)
train F1-score: 1.000 (0.000)
train Precision: 1.000 (0.000)
test Accuracy: 0.807 (0.073)
test F1-score: 0.795 (0.070)
test Precision: 0.832 (0.056)

Hi @dakshvar22,

just for your information: I am currently discussing a german version with the PolyAI team. I would like to do some experiments with it, maybe it beats the BERT baseline we currently have because I think the approach is better suited for the use case.

If you are interested in it, I’d share the outcome.

Regards Julian

Any clue on when it will be available for Windows machines?

# pip install rasa==1.5.0

ERROR: Could not find a version that satisfies the requirement rasa==1.5.0 

do I have to do something special ?

facing same issue please respond @erohmensing @JulianGerhard @Juste @Emma