Question about entity extraction on lookup table

BrianYing · June 19, 2019, 3:08am

Hi guys! I have a question about entity extraction on lookup table:

I implemented a lookup table with a number company names – some of them have form of “A & B” or “C, D, LLC”. Now I can use RegexFeaturizer and CRFEntityExtractor to identify them. However, it will pick up “A”, “&”, “B” and “C”, “D”, “LLC” separately.

Is there a way to config the pipeline to make rasa pick up “A & B” and “C D LLC” together?

Also do you guys have any recommendations/advice on tuning pipeline to achieve better performance?

Thanks in advance!

amn41 · June 23, 2019, 2:55pm

Hi @BrianYing - how are you annotating these entities? if you annotatte them as e.g. company [C D LLC](company) then the whole company names should be recognized as an entity.

The lookup table just provides extra features to help the CRF figure out where entities are, it doesn’t behave as a hard lookup

BrianYing · June 24, 2019, 2:19pm

I see. I do annotate entities in the way you described. However, I don’t have all company names appeared in the training texts. So for “A & B” in lookup table as company, it was not picked up correctly. So is there any way to solve this type of issue?

Thanks!

amn41 · June 24, 2019, 3:22pm

I’m not sure I’ve understood - so “A & B” is in your lookup table but not in your training examples, correct? that should still work.

to be quite sure, you can also remove the "low" feature from the CRF so that it doesn’t see the words themselves at all (so it has to pay attention to the lookup table feature).

something like:

language: "en"

pipeline:
- name: "WhitespaceTokenizer"
- name: "CountVectorsFeaturizer"
- name: "EmbeddingIntentClassifier"
- name: "CRFEntityExtractor"
  features:
    # features for word before token
    - ["low", "title", "upper", "digit"]
    # features of token itself
    - ["bias", "upper", "title", "digit", "pattern"]
    # features for word after the token we want to tag
    - ["low", "title", "upper", "digit"]

depending on which other pipeline components you’re using

BrianYing · June 24, 2019, 3:47pm

Yes, your understanding is correct. I have training examples for some of company names. However, since I have a large amount of names so I put them into lookup table.

Do you mean remove all “low” or just the one in the middle(features of token itself)?

Thanks!

Topic		Replies	Views
Lookup Table or Multiple Examples? Rasa Open Source	12	3547	December 18, 2023
Lookup Table not working for DIET Classifier + RegexFeaturizer Rasa Open Source	10	2125	June 29, 2021
Question about optimal lookup table usage Rasa Open Source	5	1350	August 8, 2019
How can i use a lookup table por my entity? Rasa Open Source	4	840	October 10, 2022
Lookup table didn’t work for RegexFeaturizer + DIETClassifier Rasa Open Source	20	1950	February 4, 2022

Question about entity extraction on lookup table

Related topics