Having trouble formatting training examples that contains a '-' or other punctuation signs

ganeshv · January 26, 2020, 6:10am

I keep getting an error for training examples that contain a punctuation. For instance, please see the following training examples with my annotation and the errors thrown further below -

- Budget around [1200](budgetLowerLimit)-[2000](budgetUpperLimit) if that's even a possibility anymore ?
- I'm looking for somewhere around [3800$](budgetUpperLimit)
- Budget is [2100](budgetLowerLimit)-[2500](budgetUpperLimit)

Here are the errors I got -

/build/lib/python3.6/site-packages/rasa/nlu/extractors/crf_entity_extractor.py:515: UserWarning: Misaligned entity annotation for '1200-2000' in sentence 'Budget around 1200-2000 if that's even a possibility anymore ?' with intent 'state_preferences'. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don't include trailing whitespaces or punctuation).
  f"Misaligned entity annotation for '{collected_text}' "

/build/lib/python3.6/site-packages/rasa/nlu/extractors/crf_entity_extractor.py:515: UserWarning: Misaligned entity annotation for 'around 3800' in sentence 'I'm looking for somewhere around 3800$' with intent 'state_preferences'. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don't include trailing whitespaces or punctuation).
  f"Misaligned entity annotation for '{collected_text}' "

/build/lib/python3.6/site-packages/rasa/nlu/extractors/crf_entity_extractor.py:515: UserWarning: Misaligned entity annotation for '2100-2500' in sentence 'Budget is 2100-2500' with intent 'state_preferences'. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don't include trailing whitespaces or punctuation).
  f"Misaligned entity annotation for '{collected_text}' "

In the first and third example, I also tried escaping the ‘-’ character (as per markdown syntax) but I keep getting the same error when training the model.

Can anyone help how to avoid this and if this will cause a serious problem in my training that I’m not foreseeing today? Thanks in advance!

Ghostvv · January 27, 2020, 10:11am

There is a regex that scraps special characters here: rasa/whitespace_tokenizer.py at 31f2357a661ca31bbce3e471dfd065a1c4542a8d · RasaHQ/rasa · GitHub

It does the wrong thing in your case, e.g. 1200-2000 is recognized as a single token and $ is removed from token

ganeshv · January 28, 2020, 2:58pm

Hello @Ghostvv, thanks for pointing out the source! I think I have the following options -

Check if Duckling interprets the upper and the lower limits and the values correctly
Create a custom component to ignore the warnings related to “-” and “$”

Are there other options that you think I should try? Are there any risks of interpreting two items on either side of a “-” as separate entities?

Sorry for the delay in my reply. I had to check the remaining examples in my training data to see if there are other characters I should watch out for.

Ghostvv · January 30, 2020, 1:22pm

I don’t think there is a risk, we just came up with this regex, because we thought it is more or less general. The main problem is how different people annotate their entities

ganeshv · January 30, 2020, 3:43pm

Understood. In my case, there will be two entities on either side of the “-” character. However, since these are typically amounts, Duckling does a good job of pulling these values.

Topic		Replies	Views
Warning for arabic annotation during training Rasa Open Source	0	324	March 11, 2022
[HELP NEEDED] Misaligned entity annotation in message Rasa Open Source	6	1839	September 13, 2022
Rasa not picking special characters in an entity Rasa Open Source	9	3338	May 12, 2020
Misaligned entity annotation Rasa Open Source	7	4614	June 3, 2020
Sinhala entity classifications Rasa Open Source	1	367	July 8, 2020

Having trouble formatting training examples that contains a '-' or other punctuation signs

Related topics