Hey,
Recently the folks at PolyAI have open-sourced a new sentence encoding model called ConveRT which is pretrained on a large conversational dataset and hence claims better conversational representations over traditional large language models like BERT, etc.
The idea resonates very well with what we also believe and have been working on internally at Rasa. Since they open sourced their model as a TFHub model, we decided to build a quick featurizer based on this to extract representations and use them with downstream intent classification models already existing inside Rasa.
In our internal tests, the model does give a significant boost to intent classification accuracy on multiple datasets. We would love the community to try it out and share evaluation numbers on their test sets.
We have released the featurizer as part of Rasa 1.5.0.
You can try by doing a pip install of Rasa-
pip install rasa==1.5.0
The featurizer uses an optional dependency tensorflow_text. Install it with -
pip install --no-deps tensorflow_text==1.15.1
Now you are ready to use the ConveRTFeaturizer . In the project directory, we recommend using a config along these lines -
This uses the ConveRT model as a feature extractor alone and we do not fine tune it along with our intent classifier. Please note that this model can only be used for a dataset in english language. Feel free to change the config params of EmbeddingIntentClassifier according to your dataset. It would be great if you can share the numbers for evaluation metrics on your test dataset here.
Is this language specific?
i.e. does it only work on English vocabulary words? What happens when the corpus is heavily dominated by Out Of English Vocabulary words?
as far as I am currently experiencing, this is only possible on Linux machines - Windows seems to not be supported currently.
I’ll test it on a Linux machine on monday and give you feedback on one of our medium datasets with a really great quality. Any metric in specific that you are interested in?
@psds01 This is only for english language for now. Regarding out of vocabulary words, if you have loads of them and your dataset is dominated by them, this may not help you much but ConveRT does apply a clever trick for handling OOV words. So I would still suggest to give it a try.
@JulianGerhard that would be great! You can share the F1 score numbers that rasa test nlu command would give in the intent classification report file.
Or even better, try a kfold cross validation with rasa test nlu -f 5 --cross-validation
Using the given pipeline on an english dataset with 5 distinct intents, 162 samples, doing a 5-fold cross validation with the following results:
test Accuracy: 0.916 (0.044)
test F1-score: 0.897 (0.048)
test Precision: 0.882 (0.049)
This result can be seen as really good because the dataset is really tricky in terms of mostly similar sentences leading to different intents.
Using the given pipeline on an english dataset with 52 distinct intents, 3842 samples, doing a 5-fold cross-validation with the following results:
test Accuracy: 0.934 (0.004)
test F1-score: 0.916 (0.003)
test Precision: 0.911 (0.003)
This is a medium good result since it was a domain specific dataset for which the current baseline is a finetuned BERT which performs outstanding with your EmbeddingIntentClassifier.
Using the given pipeline and setup on the german version of the set used in experiment #2
test Accuracy: 0.902 (0.004)
test F1-score: 0.902 (0.003)
test Precision: 0.910 (0.003)
I am wondering why this result is that good
One additional note: If you start to combine the ConvertFeaturizer with either a CountVectorFeaturizer or a SpacyFeaturizer, sooner or later an error appears that complains about different shapes of the input. I think this is expected (as far as I read in your doc about ConvertFeaturizer) but has to be considered in the future, otherwise you would force people to renounce on e.g. domain specific scenarios built with n grams.
@JulianGerhard Awesome! Thanks for sharing extensive results. I have some questions on the experiments -
Do you have baseline numbers on this dataset for supervised embeddings pipeline which just uses a word and char level countvectorizer and embedding intent classifier? It will be good to see if there are any gains.
Can you share the results of your baseline(finetuned BERT with EmbeddingIntentClassifier) as well?
This may come as a surprise but the original paper on ConveRT does handle out of vocabulary words in a clever fashion using a hashing trick . Here, also it will be interesting to see the test results of supervised embeddings pipeline.
You are right, the experimental branch may break with using any other featurizer with convert at the moment. We wanted to look at what’s the best convert can do at the moment.
@JulianGerhard My apologies, I interpreted the results for 3rd experiment wrongly. I thought we had sequence encoding instead of just sentence encoding. The results are indeed surprising. Nevertheless, it would be interesting to know the evaluation numbers for a baseline model.
of course you are welcome at any time. A simple supervised embedding config resulted in:
dataset:
rasa.nlu.test - test Accuracy: 0.927 (0.028)
rasa.nlu.test - test F1-score: 0.908 (0.038)
rasa.nlu.test - test Precision: 0.893 (0.047)
dataset:
test Accuracy: 0.923 (0.004)
test F1-score: 0.931 (0.003)
test Precision: 0.927 (0.003)
dataset (using “de”):
test Accuracy: 0.900 (0.004)
test F1-score: 0.910 (0.003)
test Precision: 0.913 (0.003)
So yes, there is a gain. There may be better results if I have fintuned the classifier a bit but I am lacking of time so I need to postpone that.
Regarding your BERT question the results are:
test Accuracy: 0.978 (0.004)
test F1-score: 0.963 (0.003)
test Precision: 0.962 (0.003)
After my last test the results were even better because I used another batch size / sequence length for BERT (128 → 256) but if I put that into a 5-fold cv, we would still be waiting for the result So just consider them around ~1.2% better for every metric.
Anything else that I could try? Other datasets with specific setups?
Convert was really cool off the shelf for chitchat/smalltalk. I haven’t tried it at a featurizer yet, but it worked great out-of-the-box as a response selector.
I’d love it if like spacy you could get multiple things out of it. For instance, if you could pass it a list of simple responses, and it could extract the best choice.
I imagine using it in a setup where there is one smalltalk or chitchat intent, and when it fires, you respond with the "best_smalltalk_response" on the message. Could be a great way of enabling plenty of smalltalk flexibility without having 10+ intents dedicated to it.
@matthen Thanks for the suggestions. Those are all definitely additions in the near future.
We would have a separate tokenizer ConveRTTokenizer as an alternative to WhitespaceTokenizer.
Currently the WhitespaceTokenizer can be used to generate tokens for other featurizers inside Rasa which users can use along with ConveRTFeaturizer, for e.g. - CountVectorsFeaturizer.
Similarly, we have plans on utilizing the encode_sequence and encode_response very soon.
just for your information: I am currently discussing a german version with the PolyAI team. I would like to do some experiments with it, maybe it beats the BERT baseline we currently have because I think the approach is better suited for the use case.
If you are interested in it, I’d share the outcome.