Different behavior when recognizing entities between pretrained_embeddings_spacy and supervised embeddings pipeline


I’m building a multi-languages chatbot, using different models for different languages. I’m using pretrained_embeddings_spacy for English, and supervised_embeddings for other languages.

I have seen that when recognizing an entity, there is a difference between the 2:

  • The pretrained_embeddings_spacy can recognize both uppercase and lowercase entities even when you only define the lowercase entities in training data. Furthermore, it automatically converts uppercase entities in the user’s message to lowercase.

  • The supervised_embeddings however can only recognize uppercase or lowercase entities depends on how you define them in the training data. It won’t detect lowercase entities if there is none of them in the training data (and vice-versa), you have to define them both. Furthermore, it keeps the format of the entities after recognizing them.

What is the cause of this difference ? If i’m not mistaking, they use the same CRFEntityExtractor. Is there a way to make the supervised_embeddings behave the same as pretrained_embeddings_spacy when recognizing entities ? It would be a little more convenient for me.

The pretrained embeddings use SpacyNLP (as the name says), which is set case insensitive by default so everything is set to lowercase before anything else happens. You may add

case_sensitive: true

after “SpacyNLP” to get the same behaviour.

See Components

Thank you @IgNoRaNt23, i want to make the supervised_embeddings case insensitive though, at least for now, because it would be convenient to recognize the entity whether it is uppercase or not (except for human name entity i guess). Can i set the case_sensitive of supervised_embeddings pipeline to false ?

Im not sure, but you can try. If not, you could write your own custom component that sets the message to lowercase. Also not sure if you should set all your training to lower case if you try that. But its probably not hard to do, so just try.