Rasa is not extracting Entity value with hyphen and space

hari · April 13, 2021, 5:36pm

Rasa is not extracting Entity value if it contains a hyphen and space.

NLU Data:-

intent: getcurrencyweight examples: |
- What is the weight of AAL in GDS - Equity

Ex:- [{‘entity’: ‘account’, ‘start’: 29, ‘end’: 32, ‘confidence_entity’: 0.9999918937683105, ‘value’: ‘GDS’, ‘extractor’: ‘DIETClassifier’}, {‘entity’: ‘account’, ‘start’: 35, ‘end’: 41, ‘confidence_entity’: 0.9999841451644897, ‘value’: ‘Equity’, ‘extractor’: ‘DIETClassifier’}],

The model is trained with “GDS - Equity” (please note the space before/after the hyphen) entity but it is extracting as “GDS” and “Equity” instead of “GDS - Equity”. It doesn’t seem to be a problem with multiple words or hyphen in entity value as I can get the entity “AVIVA CAD1” / "GDS-Equity (no space before/after hyphen). The problem is clearly with the combination of Hyphen and Space.

Here is the version that we are using for Rasa/RasaX/RasaSDK. RasaX:- 0.38.1 Rasa:2.4.3-full
Rasa-sdk:2.4.1

Config.yml:-

language: en pipeline:

name: WhitespaceTokenizer
name: RegexFeaturizer
name: LexicalSyntacticFeaturizer
name: CountVectorsFeaturizer OOV_token: oov token_pattern: (?u)\b\w+\b
name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
name: DIETClassifier epochs: 200 ranking_length: 5
name: EntitySynonymMapper
name: ResponseSelector epochs: 100
name: FallbackClassifier threshold: 0.3 policies:
name: TEDPolicy max_history: 10 epochs: 20 batch_size:
- 32
- 64
max_history: 6 name: AugmentedMemoizationPolicy
name: RulePolicy core_fallback_threshold: 0.3 core_fallback_action_name: “action_default_fallback” enable_fallback_prediction: True

Would appreciate any suggestions. Thanks

Ghostvv · April 14, 2021, 9:31am

what happens if you remove hyphen: What is the weight of AAL in [GDS Equity](account)?

hari · April 14, 2021, 12:06pm

Hello @Ghostvv,

If I remove hyphen it will extract the entity “GDS Equity”.

{‘entity’: ‘account’, ‘start’: 29, ‘end’: 39, ‘confidence_entity’: 0.9999830722808838, ‘value’: ‘GDS Equity’, ‘extractor’: ‘DIETClassifier’}

As mentioned in my post it is not the problem with multiple words. It’s the combination of Hyphen and Space that is causing the problem. This was working fine in the previous version (RasaX:- 0.32.2 / Rasa:1.9.6-full / Rasa-sdk:1.9.0) . We recently upgraded to the latest version and seeing this issue.

Ghostvv · April 14, 2021, 12:08pm

do you see it in rasa x? does it work in rasa open source from command line?

hari · April 14, 2021, 12:15pm

I can see it in rasa x and rasa primary server logs.

{‘entity’: ‘account’, ‘start’: 29, ‘end’: 32, ‘confidence_entity’: 0.9999918937683105, ‘value’: ‘GDS’, ‘extractor’: ‘DIETClassifier’}, {‘entity’: ‘account’, ‘start’: 35, ‘end’: 41, ‘confidence_entity’: 0.9999841451644897, ‘value’: ‘Equity’, ‘extractor’: ‘DIETClassifier’}

hari · April 14, 2021, 12:21pm

I noticed the same behavior when I tried from rasa open source command line.

Here is the log. 2021-04-14 08:18:54 DEBUG rasa.core.processor - Received user message ‘What is the weight of AAL in GDS - Equity’ with intent ‘{‘id’: -5174051988714285729, ‘name’: ‘getsecurityweight’, ‘confidence’: 0.9999801516532898}’ and entities ‘[{‘entity’: ‘security’, ‘start’: 22, ‘end’: 25, ‘confidence_entity’: 0.9999903440475464, ‘value’: ‘AAL’, ‘extractor’: ‘DIETClassifier’}, {‘entity’: ‘account’, ‘start’: 29, ‘end’: 32, ‘confidence_entity’: 0.9999918937683105, ‘value’: ‘GDS’, ‘extractor’: ‘DIETClassifier’}, {‘entity’: ‘account’, ‘start’: 35, ‘end’: 41, ‘confidence_entity’: 0.9999841451644897, ‘value’: ‘Equity’, ‘extractor’: ‘DIETClassifier’}]’

Ghostvv · April 14, 2021, 12:23pm

I think there is a bug with tokenization and/or merging extracted entities for several tokens. Could you please create bug report GitHub issue?

hari · April 14, 2021, 12:25pm

Sure will create a bug. Thanks.

hari · April 14, 2021, 12:48pm

@Ghostvv Created bug and here is the link:-

fkoerner · June 18, 2021, 11:57am

(also posted in the issue above) I was able to reproduce this, but I don’t believe it is a bug. We sub out non-word characters (including “-”) in WhitespaceTokenizer if there is a space before it and/or after it, see here.

This can be verified like so:

@pytest.mark.parametrize(
    "text, expected_tokens",
    [
        ("apple - banana", ["apple", "banana"]),
        ("apple-banana", ["apple-banana"]),
        ("apple- banana", ["apple", "banana"]),
        ("apple -banana", ["apple", "banana"]),
    ],
)
def test_whitespace_tokenizer_hyphens(text, expected_tokens):
    tk = WhitespaceTokenizer({})
    message = Message.build(text=text)
    tokens = tk.tokenize(message, TEXT)
    assert [t.text for t in tokens] == expected_tokens

At prediction time, DIETClassifier does not see the hyphen, and so cannot predict an entity that spans across the hyphen.

@hari there are a couple of different options here. You can:

write your own tokenizer, possibly based off of WhitespaceTokenizer, which does not perform this cleaning
write a custom component that comes before the tokenizer and removes whitespace around hyphens, meaning: “apple - banana” would be converted to “apple-banana”
modify your training data to include the entities without hyphens. The result is that DIETClassifier will be trained on the data in the same form that it will see it during prediction. - [apple banana](my_entity)

Please let me know if you have any questions!

Topic		Replies	Views
[SOLVED] Entity values not getting extracted from user utterances Rasa Open Source	0	577	June 1, 2019
Remove whitespace from entity Rasa Open Source	1	647	March 18, 2022
Confused Behavior about entity extraction form Rasa Open Source	2	200	November 8, 2021
Returned entity getting formatted Rasa Open Source	4	769	April 16, 2019
Rasa is not extracting entities with spaces in lookup table Rasa Open Source	0	329	November 14, 2022

Rasa is not extracting Entity value with hyphen and space

Related topics