Question about entity extraction on lookup table

Hi guys! I have a question about entity extraction on lookup table:

I implemented a lookup table with a number company names – some of them have form of “A & B” or “C, D, LLC”. Now I can use RegexFeaturizer and CRFEntityExtractor to identify them. However, it will pick up “A”, “&”, “B” and “C”, “D”, “LLC” separately.

Is there a way to config the pipeline to make rasa pick up “A & B” and “C D LLC” together?

Also do you guys have any recommendations/advice on tuning pipeline to achieve better performance?

Thanks in advance!

Hi @BrianYing - how are you annotating these entities? if you annotatte them as e.g. company [C D LLC](company) then the whole company names should be recognized as an entity.

The lookup table just provides extra features to help the CRF figure out where entities are, it doesn’t behave as a hard lookup

I see. I do annotate entities in the way you described. However, I don’t have all company names appeared in the training texts. So for “A & B” in lookup table as company, it was not picked up correctly. So is there any way to solve this type of issue?

Thanks!

I’m not sure I’ve understood - so “A & B” is in your lookup table but not in your training examples, correct? that should still work.

to be quite sure, you can also remove the "low" feature from the CRF so that it doesn’t see the words themselves at all (so it has to pay attention to the lookup table feature).

something like:

language: "en"

pipeline:
- name: "WhitespaceTokenizer"
- name: "CountVectorsFeaturizer"
- name: "EmbeddingIntentClassifier"
- name: "CRFEntityExtractor"
  features:
    # features for word before token
    - ["low", "title", "upper", "digit"]
    # features of token itself
    - ["bias", "upper", "title", "digit", "pattern"]
    # features for word after the token we want to tag
    - ["low", "title", "upper", "digit"]

depending on which other pipeline components you’re using

Yes, your understanding is correct. I have training examples for some of company names. However, since I have a large amount of names so I put them into lookup table.

Do you mean remove all “low” or just the one in the middle(features of token itself)?

Thanks!