Using pre-trained BERT for NER as custom component

Hi everyone,

I hope you are staying safe wherever you are.

Recently I wrote a custom component in order to use a Huggingface NER model for Swedish language as Entity Extractor in Rasa. It works well, however, when I use Rasa X, the PERSON entity does not show in the front-end, even though I can still see the entities printed out in my Terminal. The other entities still show up.

This is what Rasa X shows:

This is what’s in Terminal:

Below I’m attaching my script for the component. As you can see, I already mentioned all the entity in dimensions. If you know where I might make mistake, please help me pointing it out.

import typing
from typing import Any, Dict, List, Text, Optional, Type

from transformers import pipeline
from rasa.nlu.constants import ENTITIES
from rasa.nlu.components import Component
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.extractors.extractor import EntityExtractor
from rasa.nlu.utils.hugging_face.hf_transformers import HFTransformersNLP

if typing.TYPE_CHECKING:
    from rasa.nlu.model import Metadata

nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')

class BertEntityExtractor(EntityExtractor):
    @classmethod
    def required_components(cls) -> List[Type[Component]]:
        return [HFTransformersNLP]
defaults = {
    # by default all dimensions recognized by spacy are returned
    # dimensions can be configured to contain an array of strings
    # with the names of the dimensions to filter for
    "dimensions": ["PER", "LOC", "ORG", "OBJ", "EVN", "TME"]
}

def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
    super().__init__(component_config)

def process(self, message: Message, **kwargs: Any) -> None:
    # can't use the existing doc here (spacy_doc on the message)
    # because tokens are lower cased which is bad for NER
    #spacy_nlp = kwargs.get("spacy_nlp", None)
    #doc = spacy_nlp(message.text)
    doc = nlp(message.text)
    all_extracted = self.add_extractor_name(self.extract_entities(doc))
    dimensions = self.component_config["dimensions"]
    extracted = BertEntityExtractor.filter_irrelevant_entities(
        all_extracted, dimensions
    )
    message.set(ENTITIES, message.get(ENTITIES, []) + extracted, add_to_output=True)

@staticmethod
def extract_entities(doc: "Doc") -> List[Dict[Text, Any]]:

    l = []
    for token in doc:
        if token['word'].startswith('##'):
            l[-1]['word'] += token['word'][2:]
        else:
            l += [ token ]

    print(l)

    entities = [
        {
            "entity": token['entity'],
            "value": token['word'],
            "confidence": token['score'],
        }
        for token in l
    ]
    return entities

Hi @tyd, this is the discussion I mentioned.

1 Like

And Tyler also mentioned that I should also include you in this post, @koaning. Hi Vincent! :slight_smile:

1 Like

Hi Mia Le,

Happy to think along :slight_smile:. There are a few things in my mind that I might consider, so I’ll just list a few here.

  1. It seems like you are trying out spaCy and BERT. Are you aware that we’ve gotten a spaCy entity extractor too? In case you’re interested, I’ve written a guide on this topic here. Once spaCy v3 is out you should also have access to BERT-style models via spaCy and we will update our components to be compatible.
  2. When you train a model and then run rasa shell nlu can you see the entities returned in the JSON blob? I’m wondering mainly because you’re printing an intermediate result. Not the actual entities that are attached to the message object. Also, could you share your entire config.yml here? I’m wondering if there’s a component that’s accidentally overwriting the entities.
  3. Is there a reason why you prefer using BERT as an entity extractor? Swedish BERT can also be used as a feature for the modeling pipeline. The added features should contribute to the DIET algorithm’s ability to detect entities as well and this might be a whole lot simpler to try out; implementation-wise. You should be able to use our standard LanguageModelFeaturizer.
  4. In the case of detecting names, I’m currently open-sourcing/experimenting with name lists. An issue with detecting names is that models are usually optimized towards a local corpus. You might have a pre-trained model that is very good at detecting Swedish names but very poor at detecting non-Swedish names in the Swedish language. For French language models, it’s been a problem for some of our users in Senegal. It might also be “worth a try” to download the most common names from a government census and just perform string matching on those.

Hi Vincent, thank you for your answer. Below are my answers:

  1. I am aware that you have a SpaCy entity extractor and I myself used spaCy v3 nightly to transfer the BERT into SpaCy model already. But spacy-transformers did not seem to transfer the NER pipeline for Swedish language unfortunately. So that’s why I decided to write a custom component.
  2. You are absolutely right about this. My bad! So that means the only problem now is that the custom component does not show for PERSON entity, the rest worked fine. Here is my config.yml:
language: sv
pipeline:
- name: HFTransformersNLP
  model_name: "bert"
  model_weights: "KB/bert-base-swedish-cased-ner"
- name: LanguageModelTokenizer
  intent_tokenization_flag: False
  intent_split_symbol: " "
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
  OOV_token: _oov_
  use_shared_vocab: False
- name: KeywordIntentClassifier
  case_sensitive: True
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: DIETClassifier
- name: LanguageModelFeaturizer
- name: test.BertEntityExtractor
- name: CRFEntityExtractor
  features": [
    [
      "low",
      "title",
      "upper",
      "pos"
    ],
    [
      "bias",
      "low",
      "digit",
      "pos",
      "pos2",
      "prefix5",
      "suffix5",
      "text_dense_features"
    ],
    [
      "low",
      "title",
      "upper",
      "pos"
    ]
  ]
- name: EntitySynonymMapper
- name: ResponseSelector
  epochs: 100
policies:
- name: MemoizationPolicy
- name: TEDPolicy
  max_history: 5
  epochs: 100
- name: MappingPolicy
- name: FallbackPolicy
  nlu_threshold: 0.37
  core_threshold: 0.3
  fallback_action_name: action_default_fallback
  ambiguity_threshold: 0.1
  1. I prefer using BERT because this SweBERT I used has a pipeline for pre-trained NER, the idea is to have a component to detect all these NER without us having to provide any other examples or training data. As you can see from my config.yml above, I did also use LanguageModelFeaturizer.

  2. I also tested with some very very Swedish names, however it does not detect either. So I thought there is some errors in my custom component Python script? :thinking:

Thank you so much for your time, Vincent :blush: I assume that you are very busy so I appreciate your help a lot!

Interesting.

I think this might be a usecase for … rasa-nlu-examples!

It’s a project that I maintain and there’s a fancy “printer” component here. It prints information at different parts of the pipeline. It should help debug. Could you put a printer before/after your custom extractor, DIET and the CRF extractor?

Also, just to check, are the entities also properly defined in your domain.yml file? Also, does your nlu.md file have examples of the entities?

Thank you for the printer recommendations, I’m unfortunately now not able to test it yet, but I’ll do that as soon as I can.

Since I assume this works similar to the SpacyEntityExtractor, I don’t mention these entities in my domain.yml file and also don’t have examples of the entities in nlu.md. Should I do that?

I recall with the RegexEntityExtractor you need at least two examples of the entity in your nlu data before it can kick in and help. There might be something similar happening here. It shouldn’t happen, but this is a “just in case” thing to check.

The printer should give more information though.

I tried the Printer component, and here is the result

after test
intent: {'name': 'celebrity_like', 'confidence': 0.8302780389785767}
entities: [{'entity': 'genre', 'start': 28, 'end': 37, 'value': 'Stockholm', 'extractor': 'DIETClassifier'}, {'entity': 'PER', 'value': 'Mia', 'confidence': 0.9998674392700195, 'extractor': 'BertEntityExtractor'}, {'entity': 'LOC', 'value': 'Stockholm', 'confidence': 0.9988166689872742, 'extractor': 'BertEntityExtractor'}]

test is my test.py, I really need to rename this so it’s less confusing. But seems to me like it works? Please don’t mind the wrong intent and entities, this is just my test bot :slight_smile:

But the correct entities are supposed to be:

  • LOC: Stockholm
  • PER: Mia

OK. Nice.

Then it seems like there might be a bug in Rasa X. With the confirmation from the printer it feels less likely that there’s a bug is in your code.

@tyd, I might want to check this with somebody from the Rasa X team, would you know who to ping?

In the meantime, @mia.le0711 could you confirm that it does work if you use DIET and Swedish BERT as a featurizer?

I confirmed!

1 Like

Then I propose you use this featurization approach for now and I’ll try to gather internal intel about what might be going awry here.

If you’re willing to run an extra “benchmark”. I would be curious to learn how much better BERT is compared to simply string matching on baby-names. I have a small concern that BERT might be overfitting on Swedish-sounding names.

Thank you for helping me out.

About the benchmark, I don’t really quite understand what you mean, can you help me explain it further?

I’m assuming you’ve got a dataset and that you’re investigating how good “BERT” is at detecting names in your use-case.

This is all fair and well, but I’m curious if you really need BERT. If instead, you use our RegexEntityExtractor to apply a lookup table of baby names you might have a much more light-weight approach.

It’d be interesting to see which approach works out best. That’s what I mean with "benchmark:.

@mia.le0711 Would you be able to create a bug report for this here?

I understand what you mean now. And actually I haven’t tried the lookup table solution. I will test it out. Thank you again for your help so far!

1 Like

Yes, sure!