Rasa is not extracting Entity value with hyphen and space

Rasa is not extracting Entity value if it contains a hyphen and space.

NLU Data:-

  • intent: getcurrencyweight examples: |

Ex:- [{‘entity’: ‘account’, ‘start’: 29, ‘end’: 32, ‘confidence_entity’: 0.9999918937683105, ‘value’: ‘GDS’, ‘extractor’: ‘DIETClassifier’}, {‘entity’: ‘account’, ‘start’: 35, ‘end’: 41, ‘confidence_entity’: 0.9999841451644897, ‘value’: ‘Equity’, ‘extractor’: ‘DIETClassifier’}],

The model is trained with “GDS - Equity” (please note the space before/after the hyphen) entity but it is extracting as “GDS” and “Equity” instead of “GDS - Equity”. It doesn’t seem to be a problem with multiple words or hyphen in entity value as I can get the entity “AVIVA CAD1” / "GDS-Equity (no space before/after hyphen). The problem is clearly with the combination of Hyphen and Space.

Here is the version that we are using for Rasa/RasaX/RasaSDK. RasaX:- 0.38.1 Rasa:2.4.3-full
Rasa-sdk:2.4.1

Config.yml:-

language: en pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer OOV_token: oov token_pattern: (?u)\b\w+\b
  • name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 200 ranking_length: 5
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100
  • name: FallbackClassifier threshold: 0.3 policies:
  • name: TEDPolicy max_history: 10 epochs: 20 batch_size:
    • 32
    • 64
  • max_history: 6 name: AugmentedMemoizationPolicy
  • name: RulePolicy core_fallback_threshold: 0.3 core_fallback_action_name: “action_default_fallback” enable_fallback_prediction: True

Would appreciate any suggestions. Thanks

what happens if you remove hyphen: What is the weight of AAL in [GDS Equity](account)?

Hello @Ghostvv,

If I remove hyphen it will extract the entity “GDS Equity”.

{‘entity’: ‘account’, ‘start’: 29, ‘end’: 39, ‘confidence_entity’: 0.9999830722808838, ‘value’: ‘GDS Equity’, ‘extractor’: ‘DIETClassifier’}

As mentioned in my post it is not the problem with multiple words. It’s the combination of Hyphen and Space that is causing the problem. This was working fine in the previous version (RasaX:- 0.32.2 / Rasa:1.9.6-full / Rasa-sdk:1.9.0) . We recently upgraded to the latest version and seeing this issue.

do you see it in rasa x? does it work in rasa open source from command line?

I can see it in rasa x and rasa primary server logs.

{‘entity’: ‘account’, ‘start’: 29, ‘end’: 32, ‘confidence_entity’: 0.9999918937683105, ‘value’: ‘GDS’, ‘extractor’: ‘DIETClassifier’}, {‘entity’: ‘account’, ‘start’: 35, ‘end’: 41, ‘confidence_entity’: 0.9999841451644897, ‘value’: ‘Equity’, ‘extractor’: ‘DIETClassifier’}

I noticed the same behavior when I tried from rasa open source command line.

Here is the log. 2021-04-14 08:18:54 DEBUG rasa.core.processor - Received user message ‘What is the weight of AAL in GDS - Equity’ with intent ‘{‘id’: -5174051988714285729, ‘name’: ‘getsecurityweight’, ‘confidence’: 0.9999801516532898}’ and entities ‘[{‘entity’: ‘security’, ‘start’: 22, ‘end’: 25, ‘confidence_entity’: 0.9999903440475464, ‘value’: ‘AAL’, ‘extractor’: ‘DIETClassifier’}, {‘entity’: ‘account’, ‘start’: 29, ‘end’: 32, ‘confidence_entity’: 0.9999918937683105, ‘value’: ‘GDS’, ‘extractor’: ‘DIETClassifier’}, {‘entity’: ‘account’, ‘start’: 35, ‘end’: 41, ‘confidence_entity’: 0.9999841451644897, ‘value’: ‘Equity’, ‘extractor’: ‘DIETClassifier’}]’

I think there is a bug with tokenization and/or merging extracted entities for several tokens. Could you please create bug report GitHub issue?

Sure will create a bug. Thanks.

@Ghostvv Created bug and here is the link:-

1 Like

(also posted in the issue above) I was able to reproduce this, but I don’t believe it is a bug. We sub out non-word characters (including “-”) in WhitespaceTokenizer if there is a space before it and/or after it, see here.

This can be verified like so:

@pytest.mark.parametrize(
    "text, expected_tokens",
    [
        ("apple - banana", ["apple", "banana"]),
        ("apple-banana", ["apple-banana"]),
        ("apple- banana", ["apple", "banana"]),
        ("apple -banana", ["apple", "banana"]),
    ],
)
def test_whitespace_tokenizer_hyphens(text, expected_tokens):
    tk = WhitespaceTokenizer({})
    message = Message.build(text=text)
    tokens = tk.tokenize(message, TEXT)
    assert [t.text for t in tokens] == expected_tokens

At prediction time, DIETClassifier does not see the hyphen, and so cannot predict an entity that spans across the hyphen.

@hari there are a couple of different options here. You can:

  1. write your own tokenizer, possibly based off of WhitespaceTokenizer, which does not perform this cleaning
  2. write a custom component that comes before the tokenizer and removes whitespace around hyphens, meaning: “apple - banana” would be converted to “apple-banana”
  3. modify your training data to include the entities without hyphens. The result is that DIETClassifier will be trained on the data in the same form that it will see it during prediction. - [apple banana](my_entity)

Please let me know if you have any questions!