NER_CRF model is not generalizing

Hi, below is my config file:

config = “”" language: “en”

pipeline:

  • name: “SpacyNLP” # loads the spacy language model model: “en_core_web_md” case_sensitive: false

  • name: “SpacyTokenizer” # splits the sentence into tokens

  • name: “SpacyFeaturizer” # transform the sentence into a vector representation

  • name: “SklearnIntentClassifier” # uses the vector representation to classify using SVM

  • name: “RegexFeaturizer”

  • name: “CRFEntityExtractor” features: [ [“low”, “title”, “upper”,“prefix2”,“suffix2”], [“bias”, “low”, “prefix5”, “prefix2”, “suffix5”, “suffix3”, “suffix2”, “upper”, “title”, “digit”, “pattern”], [“low”, “title”, “upper”,“prefix2”,“suffix2”] ]

  • name: “EntitySynonymMapper” # trains the synonyms “”"

I have now provided almost 300 training examples in my intents. NER is able to identify the entity on which i have already trained but it is not generalizing to entity that are not in training data. for example:

Let say i have training data as:

  • How do i get number of issue for ABC
  • How do i get number of students in XYZ
  • How many number of students are in NOP

I have almost 300 training examples as such. Model is able to predict the [ABC],[XYZ] and [NOP] but it doesn’t recognizes [DEF] as dept. I don’t feel the answer to this query shall be “Include more training examples” Because the CRF model of CORENLP starts generalizing to these entities with very less amount of training data.

P.S. : These are just the sample names i have used. Can anyone provide any thoughts on this? @akelad @erohmensing

@abhi_bh_nlp have you tried using a regex for this?

Well the Questions that i have posted is very simplistic. real Questions are : What is the count of high risk issues that are due in less than 30 days for banking and financial channel

I have to identify all these NE. the problem is dept name or most of the NE are not generalizing. its only predicting if the entity is present in training data. My problem is Dept name can be list of 100 or 1000 i dont know. I am not sure if regex is the option here.

CoreNLP starts generalizing at lot less data points. I am not sure why this NER is not @akelad

ah sorry, i misread your original post and thought the entities were in formats like XYZ, so e.g. issue numbers or something.

So for the department name, i would suggest using some lookup tables since that’s likely a closed list. You can also try messing around with the features that you use a bit. I see you’re using different features to the default ones: rasa/crf_entity_extractor.py at master · RasaHQ/rasa · GitHub