Multiple Entity Detection Problem

Hello, I have been trying to detect multiple entities in one sentence with DietClassifier, yet on some particular cases, It merges two seperate entities as one. For example;

My traininig data includes sufficient examples for entity named “language” and for sentence below it detects entities as “english, german” and “spanish”( 2 entities one is merged). Expected result is “english”, “german”,“spanish”( 3 entities)

“i can speak english, german and spanish”

If i don’t put any comma between english and german it would detect them separately.

Any suggestions? Thanks

I’m having the exact same problem, did you find an answer to this?

In my pipeline I use the WhitespaceTokenizer but the sentence: “I can speak english, german and spanish”. Will be labeled as: I can speak [english, german](lang) and [spanish](lang). It seems the tokenizer ignores the space after the comma in this sentence. Like you described, removing the comma from the sentence fixes the issue.

According to the documentation and the code the comma should be removed from the sentence before tokenizing so I’d expect the two entities to be picked up individually.

Edit: a single entity with a comma after it is picked up just fine, e.g.: I can speak [english](lang), how about you?.

The entity extractor simply combines words when they have the same entity, even when they are defined and trained separately.

Imagine we have the sentence: I can speak english, german and spanish. This will be tagged as: [‘O’, ‘O’, ‘O’, ‘lang’, ‘lang’, ‘O’, ‘lang’], the extractor loops over these tags and will merge the two consecutive ‘lang’ tags (see here).

So this looks like intended behaviour, splitting the entities can be done in a custom action (which is the route I ended up taking).

While i am searching for answer, what i saw this behaviour is implemented for a bug which caused by tokenizers like converttokenizer splitting words to subwords. So this behaviour is for merging those subwords together.

But this implementation causes merging well defined entities which can be one or more words.

For example; having two entities like city names which one is “New York” and other is “Berlin”, This behaviour gives me [New York Berlin] city which is a big problem for well defined entities.

What i think code should at least filter defined entitties which i am trying to implement but i am not familiar with code base thats why it is a slow progress