The model is trained with “GDS - Equity” (please note the space before/after the hyphen) entity but it is extracting as “GDS” and “Equity” instead of “GDS - Equity”. It doesn’t seem to be a problem with multiple words or hyphen in entity value as I can get the entity “AVIVA CAD1” / "GDS-Equity (no space before/after hyphen). The problem is clearly with the combination of Hyphen and Space.
Here is the version that we are using for Rasa/RasaX/RasaSDK.
RasaX:- 0.38.1
Rasa:2.4.3-full
Rasa-sdk:2.4.1
As mentioned in my post it is not the problem with multiple words. It’s the combination of Hyphen and Space that is causing the problem. This was working fine in the previous version (RasaX:- 0.32.2 / Rasa:1.9.6-full / Rasa-sdk:1.9.0) . We recently upgraded to the latest version and seeing this issue.
(also posted in the issue above)
I was able to reproduce this, but I don’t believe it is a bug. We sub out non-word characters (including “-”) in WhitespaceTokenizer if there is a space before it and/or after it, see here.
At prediction time, DIETClassifier does not see the hyphen, and so cannot predict an entity that spans across the hyphen.
@hari there are a couple of different options here. You can:
write your own tokenizer, possibly based off of WhitespaceTokenizer, which does not perform this cleaning
write a custom component that comes before the tokenizer and removes whitespace around hyphens, meaning:
“apple - banana” would be converted to “apple-banana”
modify your training data to include the entities without hyphens. The result is that DIETClassifier will be trained on the data in the same form that it will see it during prediction.
- [apple banana](my_entity)