Pipeline'ing NER

My bot has some trouble recognizing first names and surnames. Also, the problem is that I use forms, so that I want to extract entities from single word responses, e.g. “First name?”, “Adam”. For some reason this latter option does not work well (sometimes the entity is not extracted, sometimes the intent is improperly recognized as a detour from the happy path), although it would seem trivial.

Since I work in Polish, I know that there is quite a good model for NER in spaCy, which includes recognizing names. The problem is that it does not distinguish between first names and last names, and just returns a single entity person_name (e.g. “Adam Smith”). I was thinking that the best solution would be to put spacy NER in the pipeline before the CRF extractor, in hope that it would treat the outputs of the former, as a feature in deciding. Is it possible?

BTW: Can i block detours from happy paths in any way?

Currently, it is not implemented. So, your cannot use the output of one entity extractor as features for another one. However, you could write your own custom component that adapts the SpacyEntityExtractor to exactly do that.

Another idea, that might help. are lookup tables (Training Data Format). If a first name is present in that list, the corresponding feature “present in first name lookup table” would be set. That might help during training your CRF. Apart from that, you should make sure to have a couple of training examples that just mention the first name in your training data.

Thanks. In the end lookup tables with the right amount of data seem to work well. Am I right in assuming that I dont need to list the lookup tables in the features for the CRF entity extractor in config.yml? At the moment there is no explicit feature list there.

The lookup tables are processed by the RegexFeaturizer. It generates features out of the lookup tables. So, you should have this component in your pipeline before the CRFEntityExtractor. You don’t need to change any config parameter of the entity extractor.

Yes, I have the RegexFeaturizer. Thanks a bunch!