CRFEntityExtractor and large lookup tables

Hello (btw, I just joined the forum), at the moment have a lookup table with about 10000 names.

Finding every name entity works fine with RegexEntityExtractor, but I would like to use groups and/or roles of the entities too. For that I need CRFEntityExtractor. But finding finding all the names in the lookup table seems to be quite hard, quite many training examples are needed in the nlu.yml file.

Names like “Leonardo da Vinci” are sometimes put in two person entities, but adding several training examples with names like that seems to help a lot.

I will probably generate more names in the future meaning even more training examples. It seems a bit uncertain, that will the CRFEntityExtractor find all the names from the lookup table.

The reason for the entity groups is, that the bot should handle questions like this: “When were Marie Curie and Albert Einstein born and when did Leonardo da Vinci die?”

Marie Curie and Albert Einstein would have group “born” and Leonardo da Vinci group “died”. Now it would be easy to retrieve birthdays and death day in actions.py to right people.

I would like to ask, is there any particular philosophy behind training examples on finding all the names in large lookup tables with CRFEntityExtractor?

If I will have in the future, say 70000, full names in the lookup table(s), is there any good strategy for the training examples?

1 Like

It is better to use a pre-trained model if there is one available. This avoids the need to train your own model every time a change is made. Spacy does a good job of extracting names so I suggest use that as a base. If there are names that are missed then either add those as examples to DIET or as lookups. If you want to restrict to specific names then lookup the extracted entities in an in-memory dictionary.

Spacy seemed to extract the names correctly, but I didn’t get the entity groups when using Spacy. I’m beginner especially with the config.yml’s pipeline related things, so I don’t know if I did something wrong. :slight_smile:

According to the documentation only the CRFEntityExtractor and DIETClassifier can extract entities with groups and roles.

To my first post I would like to add a question, that does the following help the CRFEntityExtractor find the names from a lookup table:

[nlu.yml]
- regex: PERSON
  examples: |
    - ^[\w'\-,.][^0-9_!¡?÷?¿/\\+=@#$%ˆ&*(){}|~<>;:[\]]{2,}$

I found the above regex pattern from Stackoverflow. If in addition regex pattern can help the CRFEntityExtractor to find the names from a lookup table, can anyone suggest perhaps a better working pattern?

Because the RegexEntityExtractor seems to find with 100% accuracy the entities from lookup tables, I wonder that is it possible to use both RegexEntityExtractor and CRFEntityExtractor so, that RegexEntityExtractor “gives” data to CRFEntityExtractor and the CRFEntityExtractor would give both the name entities and the groups related to them? (This question probably proves, that I’m just a beginner with pipeline :slight_smile: )

Perhaps instead of groups you could use intents e.g. an intent to ask about birth and another about death; or use the entities to look up data on the person then apply a question-answering model to extract the answer from that data.