How can I use only a selected part of the training data for the CRF model?

I’m exploring an approach where I use ner_spacy as my general entity extractor, and in special cases where ner_spacy performs badly, I train ner_crf to handle those, and then put a logistic regression on top of ner_spacy and ner_crf to pick the right entity.

My problem is, how can I tag part of the training data that I want to be used with ner_crf? I tried adding the extractor field, with the corresponding values ner_spacy / ner_crf, but then realised that the function filter_trainable_entities just removes the entities, not the whole training data. This only results in confusing the CRF model.

I tried adding a new field to the specialized messages: specialized_crf: True, and subclassed the CRFEntityExtractor such that its training function would select only the training data with that field set to True. But it seems that the specialised_crf field gets deleted somewhere in the pipeline.

Any ideas how I can tag part of the data and have my subclassed CRF entity extractor filter by it?

One viable approach that I’m going with now is to pass a config parameter to my subclassed CRF, containing the path to the data I want to use. Any other ideas welcome.

PS. Why do all my forum posts appear with a grey title? I see a few others with grey titles as well, but all of mine are. I just don’t understand what it means.

All the entities you label in your training data are only used by ner_crf, never by ner_spacy. ner_spacy is a pretrained entity module.

As for the grey title, I think that just means you’ve looked at the post before :slight_smile: