[HELP NEEDED] Misaligned entity annotation in message

(edited: add NLU config)

Rasa version

Rasa Version      :         2.8.16
Minimum Compatible Version: 2.8.9
Rasa SDK Version  :         2.8.3
Rasa X Version    :         None
Python Version    :         3.8.12
Operating System  :         Linux-5.4.0-90-generic-x86_64-with-glibc2.27

Full error message

UserWarning: Misaligned entity annotation in message 'I am .NET Developer' with intent 'int_provide_info'. Make sure the start and end values of entities ([(5, 19, '.NET Developer')]) in the training data match the token boundaries ([(0, 1, 'I'), (2, 4, 'am'), (6, 9, 'NET'), (10, 19, 'Developer')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data

NLU config

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_weights: "distilbert-base-uncased"
    model_name: "distilbert"

  - name: RegexFeaturizer
    "case_sensitive": False 

  - name: DIETClassifier
    batch_strategy: balanced 
    epochs: 25
    constrain_similarities: true
    
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
    constrain_similarities: true

After training, my model can not successfully detect .Net Developer as entity JOB. I have no idea how to fix the above warning, any suggestion is welcome

@UnknownHow can you please see this awesome blog : How to Use BERT in Rasa NLU | The Rasa Blog | Rasa hope this will help you more above the pipeline.

1 Like

Thank you,

I think I misunderstood Rasa docs somewhere. Here is the quote

DEPRECATED IN 2.1
The HFTransformersNLP is deprecated and will be removed in 3.0. The LanguageModelFeaturizer now implements its behavior.
DEPRECATED IN 2.1
The LanguageModelTokenizer is deprecated and will be removed in a future release. The LanguageModelFeaturizer now implements its behavior. Any tokenizer can be used in its place.

That’s why I placed WhiteSpaceTokenizer before LanguageModelFeaturizer

@UnknownHow Hope your issue is solved now with my above suggestion? or still there is an issue?

UserWarning: Misaligned entity annotation in message ‘Chef de file/artiste sandwich’ with intent ‘inform’. Make sure the start and end values of entities ([(13, 29, ‘artiste sandwich’)]) in the training data match the token boundaries ([(0, 4, ‘Chef’), (5, 7, ‘de’), (8, 20, ‘file/artiste’), (21, 29, ‘sandwich’)]). Common causes:

  1. entities include trailing whitespaces or punctuation
  2. the tokenizer gives an unexpected result, due to languages such as Chinese that don’t use whitespace for word separation

@nik202 or anyone can help regarding this.

@kranthi_419 what is your use case and can you share some error traceback with screenshot?

I was getting this warming while training the model. This is my training data. {“intent”: “inform”, “text”: “Medical Billing Manager,Credentialing and IT Manager”, “entities”: [{“entity”: “title”, “value”: “Medical Billing Manager”, “start”: 0, “end”: 23}, {“entity”: “title”, “value”: “Credentialing and IT Manager”, “start”: 24, “end”: 52}], “tag”: “ALdata_20200418-20200426”, “language”: “english”, “project”: “default”, “model”: “english”, “data_type”: “common_examples”}

this was the warming. #033[0m#033[93m/opt/program/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message ‘Medical Billing Manager,Credentialing and IT Manager’ with intent ‘inform’. Make sure the start and end values of entities ([(0, 23, ‘Medical Billing Manager’), (24, 52, ‘Credentialing and IT Manager’)]) in the training data match the token boundaries ([(0, 7, ‘Medical’), (8, 15, ‘Billing’), (16, 37, ‘Manager,Credentialing’), (38, 41, ‘and’), (42, 44, ‘IT’), (45, 52, ‘Manager’)]). Common causes:

  1. entities include trailing whitespaces or punctuation
  2. the tokenizer gives an unexpected result, due to languages such as Chinese that don’t use whitespace for word separation

basically i see here there was some entity matching issue. i’m using rasa 2.x. Below is the my config file…

pipeline:

  • name: HFTransformersNLP model_name: bert model_weights: bert-base-multilingual-uncased
  • name: LanguageModelTokenizer
  • name: LanguageModelFeaturizer

according to documentation, it was given language model tokenizer uses prefix component model tokenizer(which is bert model). but i see, while training it is using whitespacetokenizer for tokenizing the sentences and causing the issue. @nik202