Leveraging both spaCy and CRF entity extraction correctly

Yes, there are definitely frequent ambiguities in NER… that’s what makes it an interesting problem :wink:

I don’t agree with going down the route of fine-tuning / training from scratch the spaCy NER, that really requires a lot of data (if I remember correctly, ~5000 examples per entity type according to the creator), and moreover, if you start with the pre-trained model and want to avoid catastrophic forgetting, you need to use large amounts of pretty general training data in addition to the new case-specific data.

CRF doesn’t seem to need a lot of training data on the other hand. The question is just whether to train it with data annotated as LOC_crf and PERSON_crf, and then let Core deal with the implications of having two different entities that really refer to a “location”, and thus should have a similar effect on the dialogue flow. Or, use LOC and PERSON and then have a separate NLU component that decides whether to go with the spaCy or CRF prediction (probably mostly dependent on the CRF confidence value). Currently I’m not using Core, so I prefer the latter.

I think phrase matching is most useful if you have a fixed-length list of case-specific, non-generic entities that you want to extract? I haven’t really used it so far.

2 Likes