Multiple Entity Detection Problem

denizvoid · April 17, 2020, 3:59pm

Hello, I have been trying to detect multiple entities in one sentence with DietClassifier, yet on some particular cases, It merges two seperate entities as one. For example;

My traininig data includes sufficient examples for entity named “language” and for sentence below it detects entities as “english, german” and “spanish”( 2 entities one is merged). Expected result is “english”, “german”,“spanish”( 3 entities)

“i can speak english, german and spanish”

If i don’t put any comma between english and german it would detect them separately.

Any suggestions? Thanks

flythe · May 25, 2020, 10:27am

I’m having the exact same problem, did you find an answer to this?

In my pipeline I use the WhitespaceTokenizer but the sentence: “I can speak english, german and spanish”. Will be labeled as: I can speak [english, german](lang) and [spanish](lang). It seems the tokenizer ignores the space after the comma in this sentence. Like you described, removing the comma from the sentence fixes the issue.

According to the documentation and the code the comma should be removed from the sentence before tokenizing so I’d expect the two entities to be picked up individually.

Edit: a single entity with a comma after it is picked up just fine, e.g.: I can speak [english](lang), how about you?.

flythe · May 25, 2020, 3:07pm

The entity extractor simply combines words when they have the same entity, even when they are defined and trained separately.

Imagine we have the sentence: I can speak english, german and spanish. This will be tagged as: [‘O’, ‘O’, ‘O’, ‘lang’, ‘lang’, ‘O’, ‘lang’], the extractor loops over these tags and will merge the two consecutive ‘lang’ tags (see here).

So this looks like intended behaviour, splitting the entities can be done in a custom action (which is the route I ended up taking).

denizvoid · May 25, 2020, 6:38pm

While i am searching for answer, what i saw this behaviour is implemented for a bug which caused by tokenizers like converttokenizer splitting words to subwords. So this behaviour is for merging those subwords together.

But this implementation causes merging well defined entities which can be one or more words.

For example; having two entities like city names which one is “New York” and other is “Berlin”, This behaviour gives me [New York Berlin] city which is a big problem for well defined entities.

What i think code should at least filter defined entitties which i am trying to implement but i am not familiar with code base thats why it is a slow progress

Topic		Replies	Views
Multiple word entity detected as more entities Welcome to the Rasa Community Forum!	0	661	October 27, 2021
Unable to classify multiple examples of the same entity. Please help Rasa Open Source	7	1132	July 13, 2020
Difficulty Extracting multiple entitiy values as multiple entities of the same entity in a single message Rasa Open Source	10	2105	July 16, 2020
Multi token entities in Rasa NLU Rasa Open Source	2	784	August 30, 2018
DIETClassifier splits one entity into list of token Rasa Open Source	2	590	December 7, 2020

Multiple Entity Detection Problem

Related topics