[HELP NEEDED] Misaligned entity annotation in message

UnknownHow · December 17, 2021, 7:39am

(edited: add NLU config)

Rasa version

Rasa Version      :         2.8.16
Minimum Compatible Version: 2.8.9
Rasa SDK Version  :         2.8.3
Rasa X Version    :         None
Python Version    :         3.8.12
Operating System  :         Linux-5.4.0-90-generic-x86_64-with-glibc2.27

Full error message

UserWarning: Misaligned entity annotation in message 'I am .NET Developer' with intent 'int_provide_info'. Make sure the start and end values of entities ([(5, 19, '.NET Developer')]) in the training data match the token boundaries ([(0, 1, 'I'), (2, 4, 'am'), (6, 9, 'NET'), (10, 19, 'Developer')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data

NLU config

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_weights: "distilbert-base-uncased"
    model_name: "distilbert"

  - name: RegexFeaturizer
    "case_sensitive": False 

  - name: DIETClassifier
    batch_strategy: balanced 
    epochs: 25
    constrain_similarities: true
    
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
    constrain_similarities: true

After training, my model can not successfully detect .Net Developer as entity JOB. I have no idea how to fix the above warning, any suggestion is welcome

nik202 · December 17, 2021, 9:32am

@UnknownHow can you please see this awesome blog : How to Use BERT in Rasa NLU | The Rasa Blog | Rasa hope this will help you more above the pipeline.

UnknownHow · December 17, 2021, 1:27pm

Thank you,

I think I misunderstood Rasa docs somewhere. Here is the quote

DEPRECATED IN 2.1
The HFTransformersNLP is deprecated and will be removed in 3.0. The LanguageModelFeaturizer now implements its behavior.

DEPRECATED IN 2.1
The LanguageModelTokenizer is deprecated and will be removed in a future release. The LanguageModelFeaturizer now implements its behavior. Any tokenizer can be used in its place.

That’s why I placed WhiteSpaceTokenizer before LanguageModelFeaturizer

nik202 · December 17, 2021, 1:57pm

@UnknownHow Hope your issue is solved now with my above suggestion? or still there is an issue?

kranthi_419 · September 13, 2022, 11:37am

UserWarning: Misaligned entity annotation in message ‘Chef de file/artiste sandwich’ with intent ‘inform’. Make sure the start and end values of entities ([(13, 29, ‘artiste sandwich’)]) in the training data match the token boundaries ([(0, 4, ‘Chef’), (5, 7, ‘de’), (8, 20, ‘file/artiste’), (21, 29, ‘sandwich’)]). Common causes:

entities include trailing whitespaces or punctuation
the tokenizer gives an unexpected result, due to languages such as Chinese that don’t use whitespace for word separation

@nik202 or anyone can help regarding this.

nik202 · September 13, 2022, 3:06pm

@kranthi_419 what is your use case and can you share some error traceback with screenshot?

kranthi_419 · September 13, 2022, 7:04pm

I was getting this warming while training the model. This is my training data. {“intent”: “inform”, “text”: “Medical Billing Manager,Credentialing and IT Manager”, “entities”: [{“entity”: “title”, “value”: “Medical Billing Manager”, “start”: 0, “end”: 23}, {“entity”: “title”, “value”: “Credentialing and IT Manager”, “start”: 24, “end”: 52}], “tag”: “ALdata_20200418-20200426”, “language”: “english”, “project”: “default”, “model”: “english”, “data_type”: “common_examples”}

this was the warming. #033[0m#033[93m/opt/program/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message ‘Medical Billing Manager,Credentialing and IT Manager’ with intent ‘inform’. Make sure the start and end values of entities ([(0, 23, ‘Medical Billing Manager’), (24, 52, ‘Credentialing and IT Manager’)]) in the training data match the token boundaries ([(0, 7, ‘Medical’), (8, 15, ‘Billing’), (16, 37, ‘Manager,Credentialing’), (38, 41, ‘and’), (42, 44, ‘IT’), (45, 52, ‘Manager’)]). Common causes:

entities include trailing whitespaces or punctuation
the tokenizer gives an unexpected result, due to languages such as Chinese that don’t use whitespace for word separation

basically i see here there was some entity matching issue. i’m using rasa 2.x. Below is the my config file…

pipeline:

name: HFTransformersNLP model_name: bert model_weights: bert-base-multilingual-uncased
name: LanguageModelTokenizer
name: LanguageModelFeaturizer

according to documentation, it was given language model tokenizer uses prefix component model tokenizer(which is bert model). but i see, while training it is using whitespacetokenizer for tokenizing the sentences and causing the issue. @nik202

Topic		Replies	Views
Misaligned entity annotation in message Rasa Open Source	1	1028	July 7, 2020
Misaligned entity annotation Rasa Open Source	7	4614	June 3, 2020
NLU training data issue Rasa Open Source	2	455	October 16, 2020
Entities ending with punctuations Rasa Open Source	5	2445	June 8, 2020
Issue while classifying intent Rasa Open Source	5	589	December 18, 2019

[HELP NEEDED] Misaligned entity annotation in message

Related topics