After using SpacyTokenizer: Misaligned entity annotation error when using CRFEntityExtraction

Hey all,

I already annotated a bunch of entities and it seems like I have a problem with the custom entity extractor part.

Example:

2020-02-24 09:59:54 WARNING rasa.nlu.extractors.crf_entity_extractor - Misaligned entity annotation for ‘10/02’ in sentence ‘why my 10/02 transaction got this?’ with intent ‘complain_transaction’. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don’t include trailing whitespaces or punctuation).

As I am training using a brazilian portuguese tokenizer, I specified this at config.yml:

language : "pt-br"
pipeline:
  - name: "SpacyNLP"
  - name: "SpacyTokenizer"
  - name: "CountVectorsFeaturizer"
  - name: "CRFEntityExtractor"

Also, in a simple sanity check using the spacy API with it’s pt-br tokenizer, I’ve got the expected outputs:

import spacy
nlp = spacy.load('pt_core_news_sm')
test = nlp('why my 10/02 transaction got this?')
print(test[2].text, test[2].idx)

10/02 7

print(test[6].text, test[6].idx)

? 33

So, major points here is: the spacy tokenizer when using the PT knowledge base does the tokenization as expected, with the start and end positions of the tokens matching my annotations. Also, some may suggest to use the WhitespaceTokenizer, but it’s very important to me that the ‘?’ character gets its own token due to my custom word embedding.

I couldn’t identify the problem in the rasa source code of spacy_tokenizer to think on a workaround. I wonder if I can still use the SpacyTokenizer and not work on a custom component to deal with characters like ‘/’.

Any suggestion or help would be very appreciated!