Custom Featurizer for finetuned BERT features based on SpaCy

Hi guys,

I recently played around with Rasa and BERT since there is some evidence, that BERT can handle domain specific data very well if it is finetuned.

I took the german BERT and together with the awesome article from spaCy, I was able to finetune the BERT on my domain specific dataset. The evaluation of this finetuned BERT, used as a “simple” multi class classifier directly without the use of rasa showed an average accuracy of 96,8%. It must be noted, that in terms of evaluating the generalization of the model, two evaluations were made:

  1. a dataset consisting of real-world-documents only, 75/25 split
  2. a dataset consisting of augmented real-world-documents, 75/25 split

The best evaluation was achieved by using the non-augmented version (which was quite expected).

I then packaged the finetuned model and installed it as a spaCy model which I planned to use in my rasa pipeline.

To be able to compare the performance, my baseline is a spaCy pipeline with a custom kerasNN classifier and a CountVectorsFeaturizer that uses 2,15 ngrams and a char_wb analyzer on the small de spaCy dataset. Using the same dataset-split as mentioned before I achieved an average accuracy of 95,6% - this evaluation is based on the command rasa test nlu.

Now I changed the pipeline such that instead of using the CountVectorsFeaturizer, the SpacyFeaturizer was used, loading the pretrained, finetuned BERT. After running the same pipeline again, my accuracy dropped to ~91% - frustrating and unexpected, at least by me.

After talking to a friend, we figured out, that maybe the implementation of the SpacyFeaturizer may not be optimal for using the features, BERT is actually able to provide. Taking a look at the implementation of SpacyFeaturizer:

def features_for_doc(doc: "Doc") -> np.ndarray:
    """Feature vector for a single document / sentence."""
    return doc.vector

What does doc.vector do? It just takes the average of all the representations given by language model for each token in the input text.

This is probably sub-optimal for BERT. Instead, usually only the representation for [CLS] token is used for classifying the given text in BERT model.

So, one solution is to write a custom spacy featurizer component that does exactly this. We can also make this component a bit more complex by exactly implementing the architecure used for fine-tuning BERT (i.e. “softmax_pooler_output”), although without its classifier (i.e. softmax) layer(s).

I have noticed the awesome bert-as-a-service which is actually able to provide those features, but currently it is only stable with the BERTs from Google and has some problems with finetuned ones. In addition, one might not want to embedd a third-party-api dependency to ones config as Gao did.

My question:

Did anyone already tried this or would be interested in doing it collaboratively? I think this is a non intrusive, rasa compatible way of using the BERT benefits in rasa without impacting the runtime performance - if it works.

Regards Julian

2 Likes

Yes this sounds like a great idea. Is there any way to access the representation for the CLS token from the spacy doc? If so we could use a configuration key to specify how a doc should be transformed into its feature representation.

In the long run we should rather have a separate BERT component (or a better abstraction over BERT / spacy / …). I think @SamS has been working on this as well, what are your thoughts?

Hey @JulianGerhard, you’re right: averaging the output representations is suboptimal and only the [CLS] token’s representation should be used.

My question would be: Why not take BERT itself as a classifier instead of feeding its outputs into an external classifier? Indeed, it may be cheaper and more flexible to only re-train the small classifier if need be, but classifying directly with BERT is, I think, a solid starting position :slight_smile:

There is a working (but still work-in-progress and dirty) implementation of BERT as an intent classifier for Rasa in my project’s branch. I adapted it to be more Rasa-like and it’s based on this BERT-based intent classifier. However, the computational graph is the same as the “base” variant from Google and it’s easy to use with any fine-tuned weights. Note that the code also contains various compression techniques because my project is about speeding up BERT’s inference (see this blog post).

1 Like

Hi @tmbo, Hi @SamS,

thanks for your kind response. In the meantime I did some further research that might be interesting for you. Following my thoughts about doc.vector I tried several representations:

doc.vector

doc.tensor.mean(axis=0)

doc._.pytt_last_hidden_state.mean(axis=0)

(which was actually recommended in this article from spacy)

With a “simple” logistic regression I achieved ~97% accuracy so the problem of the huge drop had to lie somewhere else. After some investigation it turned out to be something quite simple: If I finetune a cased BERT with some effort and want to use that, I possibly should tell my rasa config that it should work case_sensitive :smiley:

Now, to briefly recapture what was done:

  1. Used a german cased pretrained BERT

  2. Finetuned that BERT with domain specific data

  3. Converted the finetuned model into a spaCy compatible model

  4. After ~45 evaluation steps the following config turned out to be best:

    language: de_pytt_bertbasecased_lg_jg pipeline:

    • name: SpacyNLP
    • case_sensitive: 1
    • name: SpacyTokenizer
    • name: SpacyFeaturizer
    • name: SklearnIntentClassifier

with ~98,9 % accuracy tested on unseen data (I assume a really good generalisation here). However it might be a good idea to further invesigate on the CLS representation. I will do that in the next days and in addition try to even improve the current baseline.

The performance of this pipeline is good: ~650ms per request.

Indeed I considered your suggestion @SamS and tried that a few days ago with more or less the same setup outside of rasa to “simply” evaluate the performance of the finetuned BERT using directly as a classifier. The results where slightly worse than with the current architecture but I didnt have the time to exactly copy my evaluations to the new setup.

One thought about why I like the current spaCy/rasa combination:

My next step will be to also finetune the BERT on a “casual”-NER and a domain-NER task which, following my current pipeline, would enable the model to provide NER features aswell. Otherwise I would have to implement a BERTClassifier-Component for Rasa and a BERTEntityExtractor-Component for Rasa. Actually I could do that, but if spaCy could handle it well, why not?

If you are interested in a collaboration or to be updated on the results or just want to comment my thoughts - feel free :slight_smile:

Regards Julian

Hi at everyone following this trail,

I’ve covered everything I talked about in a new repo bert_spacy_rasa.

The purpose of this repo was to provide a quick dive into the matter and to create a platform to evaluate those transformer logics with as close to rasa as possible techniques.

If someone is interested, feel free to share your thoughts!

Regards Julian

Hi @JulianGerhard, @SamS, @tmbo I went through your @JulianGerhard repo, found interesting… I am also trying to achieve the intent classification with the BERT embedding. What I have using is the pre-trained model from spacy - “en_pytt_bertbaseuncased_lg”.

The process I went through to integrate with spacy and nlp, -> Downloaded model from “python -m spacy download en_pytt_bertbaseuncased_lg”

-> Linked the model “python -m spacy link en_pytt_bertbaseuncased_lg bert”

-> Then mentioned then in the rasa nlu config as, language: en pipeline:

  • name: SpacyNLP case_sensitive: 1 model: bert
  • name: SpacyTokenizer
  • name: SpacyFeaturizer
  • name: SklearnIntentClassifier

And used nlu.md as a rasa data as,

intent:greet

  • hey
  • hello
  • good morning
  • good evening
  • hey there

intent:project_usecase

intent:client_info

When I used the command , rasa train nlu I ended up with the following error 2019-09-11 18:44:13 INFO rasa.nlu.model - Starting to train component SpacyNLP Traceback (most recent call last): File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\runpy.py”, line 193, in run_module_as_main “main”, mod_spec) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\runpy.py”, line 85, in run_code exec(code, run_globals) File "C:\Users\vighnesh.paramasivam\AppData\Local\Continuum\anaconda2\envs\python36\Scripts\rasa.exe_main.py", line 9, in File "c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa_main.py", line 76, in main cmdline_arguments.func(cmdline_arguments) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa\cli\train.py”, line 136, in train_nlu fixed_model_name=args.fixed_model_name, File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa\train.py”, line 384, in train_nlu _train_nlu_async(config, nlu_data, output, train_path, fixed_model_name) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\asyncio\base_events.py”, line 468, in run_until_complete return future.result() File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa\train.py”, line 414, in _train_nlu_async persist_nlu_training_data=persist_nlu_training_data, File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa\train.py”, line 443, in _train_nlu_with_validated_data persist_nlu_training_data=persist_nlu_training_data, File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa\nlu\train.py”, line 80, in train interpreter = trainer.train(training_data, **kwargs) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa\nlu\model.py”, line 195, in train updates = component.train(working_data, self.config, **context) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa\nlu\utils\spacy_utils.py”, line 155, in train attribute_docs = self.docs_for_training_data(training_data) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa\nlu\utils\spacy_utils.py”, line 146, in docs_for_training_data docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)] File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\rasa\nlu\utils\spacy_utils.py”, line 146, in docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)] File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\spacy\language.py”, line 751, in pipe for doc in docs: File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\spacy_pytorch_transformers\pipeline\tok2vec.py”, line 120, in pipe outputs = self.predict(docs) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\spacy_pytorch_transformers\pipeline\tok2vec.py”, line 162, in predict return self.model.predict(docs) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\thinc\neural_classes\model.py”, line 133, in predict y, _ = self.begin_update(X, drop=None) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\spacy_pytorch_transformers\model_registry.py”, line 347, in sentence_fwd acts, bp_acts = layer.begin_update(sents, drop=drop) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\thinc\neural_classes\feed_forward.py”, line 46, in begin_update X, inc_layer_grad = layer.begin_update(X, drop=drop) File “c:\users\vighnesh.paramasivam\appdata\local\continuum\anaconda2\envs\python36\lib\site-packages\spacy_pytorch_transformers\model_registry.py”, line 236, in get_features_forward assert isinstance(sents[0], Span) IndexError: list index out of range

Am I am using the config correctly or am I deviating somewhere please mention your comments…

Hi @Vighnesh

I am aware of this problem and currently working on it.

Ill let you know as soon I figured out what causes the problem.

Regards Julian

Sure Thanks. Awaiting your reply and eager to know, is there any other alternatives available to include bert embedding feature with RASA NLU.

Wanted to know whether my config.yml which I mentioned in the previous post is correct or not

Hi @Vighnesh,

I was able to figure out what the problem is. @dakshvar22 changed (several days ago) the code for the pipeline element SpacyNLP - I assume to support the newly added ResponseSelector. The element now provides:

provides = ["spacy_doc", "spacy_nlp", "intent_spacy_doc", "response_spacy_doc"]

The training data now usually consists of a triple including:

  • text
  • intent
  • response

Some of the methods in SpacyNLP now include:

for attribute in MESSAGE_ATTRIBUTES:

One of them is:

def docs_for_training_data(
    self, training_data: TrainingData
) -> Dict[Text, List[Any]]:

which calls docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)] for every attribute of the above mentioned. The problem is, that a “text” attribute can now be an empty string. Spacy itsself can handle this, but the spacy-pytorch-transformers library can’t and I doubt that this fact changes in the near future.

To overcome the problem, I slightly modified the methods:

def docs_for_training_data(
    self, training_data: TrainingData
) -> Dict[Text, List[Any]]:

    attribute_docs = {}
    for attribute in MESSAGE_ATTRIBUTES:
        texts = []
        for intent_example in training_data.intent_examples:
            if len(self.get_text(intent_example, attribute)):
                texts.append(self.get_text(intent_example, attribute))
        logger.info(self.get_text(training_data.intent_examples[0], attribute))
        docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)]
        attribute_docs[attribute] = docs

def train(
    self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
) -> None:

    attribute_docs = self.docs_for_training_data(training_data)
    
    for attribute in MESSAGE_ATTRIBUTES:
        if len(attribute_docs[attribute]) > 0:
            for idx, example in enumerate(training_data.training_examples):
                example_attribute_doc = attribute_docs[attribute][idx]
                if len(example_attribute_doc):
                    # If length is 0, that means the initial text feature was None and was replaced by ''
                    # in preprocess method
                    example.set(
                        MESSAGE_SPACY_FEATURES_NAMES[attribute], example_attribute_doc
                    )

Such that a classifier that follows in the Rasa pipeline gets only trained on documents that actually exist for a given attribute. If e.g. no response is defined, it gets skipped.

@dakshvar22 Do you think that this might lead to follow up problems?

Regards Julian

Hi @JulianGerhard,

This was actually a tricky case in the implementation. The reason why non-existent attribute was given an empty string was that spacy doesn’t accept None as the input string. Since we are processing spacy docs for training examples in batch mode - self.nlp.pipe(texts, batch_size=50) (much faster to do it in batch mode), I replaced all None attributes with empty strings. It would be tedious and messy to filter examples with None value for an attribute and then merge an empty doc for them. The way how you have implemented it currently has a small bug -

docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)]
attribute_docs[attribute] = docs

docs will be the spacy docs for a filtered set of texts, which means the order of the spacy docs of training examples would now be different from the order of training examples in training_data.training_examples . This would cause a problem here -

           for idx, example in enumerate(training_data.training_examples):
                example_attribute_doc = attribute_docs[attribute][idx]
                if len(example_attribute_doc):
                    # If length is 0, that means the initial text feature was None and was replaced by ''
                    # in preprocess method
                    example.set(
                        MESSAGE_SPACY_FEATURES_NAMES[attribute], example_attribute_doc
                    )

attribute_docs[attribute][idx] does not correspond to the correct spacy doc for training example at idx index inside training_data.training_examples.

I haven’t looked at spacy-pytorch-transformers library myself, but do you have any other idea to avoid this?

Hi @dakshvar22,

okay got it - so you mean that this currently only works because the list of response-examples is empty and as soon as there would be content, I would disobey the order?

Since this is kind of a showstopper or one of my bots which relies on absolute high accuracy thus was trained with BERT embeddings, I thought about several scenarios to avoid this behaviour.

The training process actually fails here:

 File "c:\users\\appdata\local\programs\python\python36\lib\site-packages\rasa\nlu\utils\spacy_utils.py", line 145, in <listcomp>
    docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)]
  File "c:\users\\appdata\local\programs\python\python36\lib\site-packages\spacy\language.py", line 752, in pipe
    for doc in docs:
  File "pipes.pyx", line 941, in pipe
  File "c:\users\\appdata\local\programs\python\python36\lib\site-packages\spacy\util.py", line 463, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))

because the pipe actually used resides in the transformers library and is defined as:

def pipe(self, stream, batch_size=128):
    """Process Doc objects as a stream and assign the extracted features.

    stream (iterable): A stream of Doc objects.
    batch_size (int): The number of texts to buffer.
    YIELDS (spacy.tokens.Doc): Processed Docs in order.
    """
    for docs in minibatch(stream, size=batch_size):
        docs = list(docs)
        outputs = self.predict(docs)
        self.set_annotations(docs, outputs)
        for doc in docs:
            yield doc

So one way would maybe be to handle things here. Any ideas?

Regards and thanks for your help

@JulianGerhard Yes, I think that would be the best place to handle the case of empty doc:

docs = list(docs)
outputs = self.predict(docs)
self.set_annotations(docs, outputs)

But this would require you to change in spacy-transformers-library. Is that ideal? Alternatively, would you be up for submitting a PR in Rasa to handle empty doc in the way I mentioned before or any other way(I am open to discuss)? That would be helpful to the community too.

Thanks

@dakshvar22 @JulianGerhard I just have a small query, I have seen that bert_embeddings and bert_intent_classifier had been included in the “rasa-nlu 0.15.1” version. Will that help it out this case!!!

Hi @dakshvar22

since this is urgent for me, I would suggest that I am going to think about a stable solution, propose it to you and then start a PR!?

Regards Julian

Sounds good. Feel free to open an issue on GH repo and tag me over there. Thanks.

Hi @Vighnesh,

this issue should be fixed from version 1.3.4 on.

Regards Julian

Hi everyone,

I updated the repo today. Two new features were added:

  • Support for DistilBERT and other transformer-based architectures
  • Support for NER on a transformer-based model

Using e.g. DistilBERT resulted in very good results in terms of intent-detection with an absolute reasonable amount of training time. Now you would even be able to add your own custom entities and simply using the hybrid of spacy and rasa to auto-extract those entities for e.g. the usage in slots.

Regards Julian