Having trouble formatting training examples that contains a '-' or other punctuation signs

I keep getting an error for training examples that contain a punctuation. For instance, please see the following training examples with my annotation and the errors thrown further below -

- Budget around [1200](budgetLowerLimit)-[2000](budgetUpperLimit) if that's even a possibility anymore ?
- I'm looking for somewhere around [3800$](budgetUpperLimit)
- Budget is [2100](budgetLowerLimit)-[2500](budgetUpperLimit)

Here are the errors I got -

/build/lib/python3.6/site-packages/rasa/nlu/extractors/crf_entity_extractor.py:515: UserWarning: Misaligned entity annotation for '1200-2000' in sentence 'Budget around 1200-2000 if that's even a possibility anymore ?' with intent 'state_preferences'. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don't include trailing whitespaces or punctuation).
  f"Misaligned entity annotation for '{collected_text}' "

/build/lib/python3.6/site-packages/rasa/nlu/extractors/crf_entity_extractor.py:515: UserWarning: Misaligned entity annotation for 'around 3800' in sentence 'I'm looking for somewhere around 3800$' with intent 'state_preferences'. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don't include trailing whitespaces or punctuation).
  f"Misaligned entity annotation for '{collected_text}' "

/build/lib/python3.6/site-packages/rasa/nlu/extractors/crf_entity_extractor.py:515: UserWarning: Misaligned entity annotation for '2100-2500' in sentence 'Budget is 2100-2500' with intent 'state_preferences'. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don't include trailing whitespaces or punctuation).
  f"Misaligned entity annotation for '{collected_text}' "

In the first and third example, I also tried escaping the β€˜-’ character (as per markdown syntax) but I keep getting the same error when training the model.

Can anyone help how to avoid this and if this will cause a serious problem in my training that I’m not foreseeing today? Thanks in advance!

There is a regex that scraps special characters here: rasa/whitespace_tokenizer.py at 31f2357a661ca31bbce3e471dfd065a1c4542a8d Β· RasaHQ/rasa Β· GitHub

It does the wrong thing in your case, e.g. 1200-2000 is recognized as a single token and $ is removed from token

Hello @Ghostvv, thanks for pointing out the source! I think I have the following options -

  1. Check if Duckling interprets the upper and the lower limits and the values correctly
  2. Create a custom component to ignore the warnings related to β€œ-” and β€œ$”

Are there other options that you think I should try? Are there any risks of interpreting two items on either side of a β€œ-” as separate entities?

Sorry for the delay in my reply. I had to check the remaining examples in my training data to see if there are other characters I should watch out for.

I don’t think there is a risk, we just came up with this regex, because we thought it is more or less general. The main problem is how different people annotate their entities

1 Like

Understood. In my case, there will be two entities on either side of the β€œ-” character. However, since these are typically amounts, Duckling does a good job of pulling these values.