Is there a way to handle extracting entities for phrases with commas other than regex entity extraction? I have a lookup table with a list of values (plenty of training examples for each of them), and a few of those values contain phrases with a comma in the middle of them, and I need those phrases to be preserved as entered when the entity is extracted (for the purposes of passing information to query a database).
For example, if I have a few company names in my lookup table:
- Cool Tech
- Corporation Inc.
- Business Name, LLC
- Business Name, LLC (subdivision ab)
I want to be able to extract “Business Name, LLC” as a single entity rather than extracting “Business Name” and “LLC” as two separate entities with the same entity label.
Have you seen the
use_word_boundaries setting in the docs? It sounds like you can configure the entity extractor that way.
Thank you for the reply @koaning. I set use_word_boundaries to True for the DIETClassifier in my pipeline, but that didn’t seem to improve the accuracy for my entity extraction. My pipeline is as follows:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: DIETClassifier
I could be wrong, but by setting the word boundary to “false” you’ll be able to detect “Business Name” as a single entity but this will also cause “Business Name, LLC” to be detected as a single entity.
I’m wondering, would it perhaps make sense to have one Regex extractor work across words and another one for detecting terms like “LLC”?
I can see that approach working, but I am also concerned if the user wants to query multiple companies with the same prefix, e.g., “Business Name, Business Name, LLC and Business Name (subdivision ab).” I’d imagine there would be a lot of overhead in my Custom Actions when parsing these entities to send to another endpoint to make sure all entities are included in the query, correct?
It depends a bit on how you fetch the entities, yeah. A spaCy model may help out here, but it may be overkill to finetune your own model for this task. There’s an online demo for their small models here, their
lg-large models tend to perform a fair bit better.