Hey! I was following this tutorial here https://medium.com/@tatiana.parshina/building-rasa-nlu-custom-component-for-lemmatization-with-spacy-e2e6a9562c34 trying to implement Lemmatization as a preprocessing step in my NLU pipeline.
I ran into an issue with the entity start and end positions. When a word gets lemmatized obviously it becomes shorter (e.g. looking -> look). So if “looking” is the first word in the sentence it has the start value 1 and the end value 7 assigned, however after lemmatization it should have the end value 4. How can I make sure, that the word positions get reassigned after lemmatization?
My lemmatization component is exactly the same as in the tutorial and my NLU pipeline is as follows: language: “de”
pipeline:
-
name: “SpacyNLP”
-
name: “SpacyTokenizer_Lemmatizer”
-
name: “SpacyFeaturizer”
-
name: “RegexFeaturizer”
-
name: “CRFEntityExtractor”
-
name: “EntitySynonymMapper”
-
name: “SklearnIntentClassifier”
The model can be trained, but I get the warning “Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don’t include trailing whitespaces or punctuation).” when training with Lemmatization. Does anyone have an idea?