Adding new token patterns to Whitespace Tokenizer

souvikg10 · September 28, 2021, 2:05pm

Hi everyone, I would like to know how does one add new token patterns. I added some regex to the whitespace tokenizer for “-” and “/” as i have certain entities like off-road where i would like the whitespace tokenizer to split the word so DIET can use the word such as off or road to still manage to match with the entity off-road since people can say the term in many ways like i like off road biking or i like off-road 4x4 for exampe.

how does one handle this. when I add the token pattern to Whitespace Tokenizer, i get such warnings

Misaligned entity annotation in message 'vegetarian/vegan' with intent 'specify2019722'. Make sure the start and end values of entities ([(0, 16, '167569490')]) in the training data match the token boundaries ([(10, 11, '/')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data

Topic		Replies	Views
Tokenization Rasa Open Source	11	694	December 23, 2021
[HELP NEEDED] Misaligned entity annotation in message Rasa Open Source	6	1838	September 13, 2022
Misaligned entity annotation Rasa Open Source	7	4614	June 3, 2020
Hindi entity extraction. Tokenizer issue Rasa Open Source	2	629	June 11, 2020
Warning for arabic annotation during training Rasa Open Source	0	324	March 11, 2022

Adding new token patterns to Whitespace Tokenizer

Related topics