For building custom NER, my pipeline is as below: pipeline:
- name: “SpacyNLP”
- name: “SpacyTokenizer”
- name: “RegexFeaturizer”
- name: “SpacyFeaturizer”
- name: “CRFEntityExtractor”
For the list of entities(~2000 examples for 2 entity types ), I am finding the start and index in my dataset using string matching. And passing it with JSON format as mentioned [here] (Training Data Format) However while training I am getting missing entity annotation error. Error description says “Make sure start and end values of the annotating training examples end at token boundaries” How can I ensure that? String matching already giving me correct start and end indices. If it is because of tokenization, how to overcome that?