Entities ending with punctuations

my nlu data looks like :

- i am looking for [O +](blood_group) blood
- i am looking for [O+](blood_group) blood
- i am looking for [O-](blood_group) blood
- i am looking for [O -](blood_group) blood
- i am looking for [B +](blood_group) blood
- i am looking for [B+](blood_group) blood
- i am looking for [B -](blood_group) blood
- i am looking for [B-](blood_group) blood
- i am looking for [A+](blood_group) blood
- i am looking for [A +](blood_group) blood
- i am looking for [A-](blood_group) blood
- i am looking for [A -](blood_group) blood
- i am looking for [AB-](blood_group) blood
- i am looking for [AB -](blood_group) blood

I get this warning :

Misaligned entity annotation in message ‘i am looking for O - blood’ with intent ‘filter’. Make sure the start and end values of entities in the training data match the token boundaries (e.g. entities don’t include trailing whitespaces or punctuation).

How can i get rid of this situation, i cant change the entities structure.

Hi @vishu1994,

The warning is coming from WhitespaceTokenizer.

If you are really concerned on the warning. Try different Tokenizer like Spacy

pipeline:

  - name: SpacyNLP

  - name: SpacyTokenizer

  - name: SpacyFeaturizer

  - name: RegexFeaturizer

  - name: LexicalSyntacticFeaturizer

  - name: CountVectorsFeaturizer

  - name: CountVectorsFeaturizer

    analyzer: "char_wb"

    min_ngram: 1

    max_ngram: 4

  - name: DIETClassifier

    epochs: 100

  - name: EntitySynonymMapper

  - name: ResponseSelector

    epochs: 100

or use Rasa-x to edit your nlu.md to match the start and end values of entities

1 Like

hey but the thing is its just not about warnings, those entities are not even considered while training.

@vishu1994, Yes, You are right I just did a dry run.

In WhitespaceTokenizer entities or not getting detected because of the warning you mentioned above.

Then I used Spacy and entities are getting detected

.

It might because WhitespaceTokenizer are not meant to understand special characters.

1 Like

Really thanks for finding out some time and sharing the neccesary details.

Do you have any idea about tokenizer for hindi language.

Actually i am having both hindi and English data in the nlu shall i go for language model support of bert and try with multilingual models which can handle many languages.

Sorry, I don t have any idea. May be this stack overflow might be good help for you