Entity Extraction for Phrases with Commas

Is there a way to handle extracting entities for phrases with commas other than regex entity extraction? I have a lookup table with a list of values (plenty of training examples for each of them), and a few of those values contain phrases with a comma in the middle of them, and I need those phrases to be preserved as entered when the entity is extracted (for the purposes of passing information to query a database).

For example, if I have a few company names in my lookup table:

  • lookup: examples: |
    • Cool Tech
    • Corporation Inc.
    • Business Name, LLC
    • Business Name, LLC (subdivision ab)

I want to be able to extract “Business Name, LLC” as a single entity rather than extracting “Business Name” and “LLC” as two separate entities with the same entity label.

1 Like

Have you seen the use_word_boundaries setting in the docs? It sounds like you can configure the entity extractor that way.

Thank you for the reply @koaning. I set use_word_boundaries to True for the DIETClassifier in my pipeline, but that didn’t seem to improve the accuracy for my entity extraction. My pipeline is as follows:

pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: DIETClassifier epochs: 100 use_word_boundaries: True

I could be wrong, but by setting the word boundary to “false” you’ll be able to detect “Business Name” as a single entity but this will also cause “Business Name, LLC” to be detected as a single entity.

I’m wondering, would it perhaps make sense to have one Regex extractor work across words and another one for detecting terms like “LLC”?

1 Like

I can see that approach working, but I am also concerned if the user wants to query multiple companies with the same prefix, e.g., “Business Name, Business Name, LLC and Business Name (subdivision ab).” I’d imagine there would be a lot of overhead in my Custom Actions when parsing these entities to send to another endpoint to make sure all entities are included in the query, correct?

It depends a bit on how you fetch the entities, yeah. A spaCy model may help out here, but it may be overkill to finetune your own model for this task. There’s an online demo for their small models here, their lg-large models tend to perform a fair bit better.