Hi guys,
I recently played around with Rasa and BERT since there is some evidence, that BERT can handle domain specific data very well if it is finetuned.
I took the german BERT and together with the awesome article from spaCy, I was able to finetune the BERT on my domain specific dataset. The evaluation of this finetuned BERT, used as a “simple” multi class classifier directly without the use of rasa showed an average accuracy of 96,8%. It must be noted, that in terms of evaluating the generalization of the model, two evaluations were made:
- a dataset consisting of real-world-documents only, 75/25 split
- a dataset consisting of augmented real-world-documents, 75/25 split
The best evaluation was achieved by using the non-augmented version (which was quite expected).
I then packaged the finetuned model and installed it as a spaCy model which I planned to use in my rasa pipeline.
To be able to compare the performance, my baseline is a spaCy pipeline with a custom kerasNN classifier and a CountVectorsFeaturizer that uses 2,15 ngrams and a char_wb analyzer on the small de spaCy dataset. Using the same dataset-split as mentioned before I achieved an average accuracy of 95,6% - this evaluation is based on the command rasa test nlu.
Now I changed the pipeline such that instead of using the CountVectorsFeaturizer, the SpacyFeaturizer was used, loading the pretrained, finetuned BERT. After running the same pipeline again, my accuracy dropped to ~91% - frustrating and unexpected, at least by me.
After talking to a friend, we figured out, that maybe the implementation of the SpacyFeaturizer may not be optimal for using the features, BERT is actually able to provide. Taking a look at the implementation of SpacyFeaturizer:
def features_for_doc(doc: "Doc") -> np.ndarray:
"""Feature vector for a single document / sentence."""
return doc.vector
What does doc.vector
do? It just takes the average of all the representations given by language model for each token in the input text.
This is probably sub-optimal for BERT. Instead, usually only the representation for [CLS] token is used for classifying the given text in BERT model.
So, one solution is to write a custom spacy featurizer component that does exactly this. We can also make this component a bit more complex by exactly implementing the architecure used for fine-tuning BERT (i.e. “softmax_pooler_output”), although without its classifier (i.e. softmax) layer(s).
I have noticed the awesome bert-as-a-service which is actually able to provide those features, but currently it is only stable with the BERTs from Google and has some problems with finetuned ones. In addition, one might not want to embedd a third-party-api dependency to ones config as Gao did.
My question:
Did anyone already tried this or would be interested in doing it collaboratively? I think this is a non intrusive, rasa compatible way of using the BERT benefits in rasa without impacting the runtime performance - if it works.
Regards Julian