NER_CRF model is not generalizing

abhi_bh_nlp · November 13, 2019, 11:00pm

Hi, below is my config file:

config = “”" language: “en”

pipeline:

name: “SpacyNLP” # loads the spacy language model model: “en_core_web_md” case_sensitive: false
name: “SpacyTokenizer” # splits the sentence into tokens
name: “SpacyFeaturizer” # transform the sentence into a vector representation
name: “SklearnIntentClassifier” # uses the vector representation to classify using SVM
name: “RegexFeaturizer”
name: “CRFEntityExtractor” features: [ [“low”, “title”, “upper”,“prefix2”,“suffix2”], [“bias”, “low”, “prefix5”, “prefix2”, “suffix5”, “suffix3”, “suffix2”, “upper”, “title”, “digit”, “pattern”], [“low”, “title”, “upper”,“prefix2”,“suffix2”] ]
name: “EntitySynonymMapper” # trains the synonyms “”"

I have now provided almost 300 training examples in my intents. NER is able to identify the entity on which i have already trained but it is not generalizing to entity that are not in training data. for example:

Let say i have training data as:

How do i get number of issue for ABC
How do i get number of students in XYZ
How many number of students are in NOP

I have almost 300 training examples as such. Model is able to predict the [ABC],[XYZ] and [NOP] but it doesn’t recognizes [DEF] as dept. I don’t feel the answer to this query shall be “Include more training examples” Because the CRF model of CORENLP starts generalizing to these entities with very less amount of training data.

P.S. : These are just the sample names i have used. Can anyone provide any thoughts on this? @akelad @erohmensing

akelad · November 27, 2019, 12:33pm

@abhi_bh_nlp have you tried using a regex for this?

abhi_bh_nlp · November 30, 2019, 4:34am

Well the Questions that i have posted is very simplistic. real Questions are : What is the count of high risk issues that are due in less than 30 days for banking and financial channel

I have to identify all these NE. the problem is dept name or most of the NE are not generalizing. its only predicting if the entity is present in training data. My problem is Dept name can be list of 100 or 1000 i dont know. I am not sure if regex is the option here.

CoreNLP starts generalizing at lot less data points. I am not sure why this NER is not @akelad

akelad · December 2, 2019, 1:09pm

ah sorry, i misread your original post and thought the entities were in formats like XYZ, so e.g. issue numbers or something.

So for the department name, i would suggest using some lookup tables since that’s likely a closed list. You can also try messing around with the features that you use a bit. I see you’re using different features to the default ones: rasa/crf_entity_extractor.py at master · RasaHQ/rasa · GitHub

Topic		Replies	Views
Using NER as a Feature for CRFEntityExtractor Rasa Open Source	6	1698	June 28, 2021
Rasa_NLU ner_crf classification issue Rasa Open Source	1	495	June 12, 2019
NER_CRF generalizes very badly Rasa Open Source	9	908	November 27, 2019
Leveraging both spaCy and CRF entity extraction correctly Rasa Open Source	8	4929	February 18, 2020
Ner_crf Rasa Open Source	12	5123	September 28, 2018

NER_CRF model is not generalizing

Related topics