After using SpacyTokenizer: Misaligned entity annotation error when using CRFEntityExtraction

bayesianwannabe · February 24, 2020, 1:18pm

Hey all,

I already annotated a bunch of entities and it seems like I have a problem with the custom entity extractor part.

Example:

2020-02-24 09:59:54 WARNING rasa.nlu.extractors.crf_entity_extractor - Misaligned entity annotation for ‘10/02’ in sentence ‘why my 10/02 transaction got this?’ with intent ‘complain_transaction’. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don’t include trailing whitespaces or punctuation).

As I am training using a brazilian portuguese tokenizer, I specified this at config.yml:

language : "pt-br"
pipeline:
  - name: "SpacyNLP"
  - name: "SpacyTokenizer"
  - name: "CountVectorsFeaturizer"
  - name: "CRFEntityExtractor"

Also, in a simple sanity check using the spacy API with it’s pt-br tokenizer, I’ve got the expected outputs:

import spacy
nlp = spacy.load('pt_core_news_sm')
test = nlp('why my 10/02 transaction got this?')
print(test[2].text, test[2].idx)

10/02 7

print(test[6].text, test[6].idx)

? 33

So, major points here is: the spacy tokenizer when using the PT knowledge base does the tokenization as expected, with the start and end positions of the tokens matching my annotations. Also, some may suggest to use the WhitespaceTokenizer, but it’s very important to me that the ‘?’ character gets its own token due to my custom word embedding.

I couldn’t identify the problem in the rasa source code of spacy_tokenizer to think on a workaround. I wonder if I can still use the SpacyTokenizer and not work on a custom component to deal with characters like ‘/’.

Any suggestion or help would be very appreciated!

Topic		Replies	Views
Misaligned entity annotation error for custom NER Rasa Open Source	0	810	July 4, 2019
[HELP NEEDED] Misaligned entity annotation in message Rasa Open Source	6	1838	September 13, 2022
Misaligned entity annotation for '01/03' in sentence (...) Rasa Open Source	5	830	March 25, 2020
Misaligned entity annotation Rasa Open Source	7	4614	June 3, 2020
NLU not predicting entities separated by the '/' character in the new version of Rasa. Why? Rasa Open Source	3	502	June 11, 2020

After using SpacyTokenizer: Misaligned entity annotation error when using CRFEntityExtraction

Related topics