hey @community @tyd @stephens I was looking a best and easy way for name recognition meant purely for Indian names and international names
Existing pipe lines
There are many pipelines that are not so efficient to grab name entity “Spacy” works for us originate names or English names but when it comes to Indian names people type there name which is converted from there native language to English so all the pipelines fails in grabbing name entity
so my solution for the bot to recognise the name(Indian) is to use custom CRFEntityExtractor and SpacyEntityExtractor so basically we will be using the composite of these two entity extractor config
# Configuration for Rasa NLU.
https://rasa.com/docs/rasa/nlu/components/
language: “en”
pipeline:
- name: SpacyNLP model: “en_core_web_lg” case_sensitive: False
- name: ConveRTTokenizer
- name: ConveRTFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CRFEntityExtractor
- name: SpacyEntityExtractor dimensions: [“PERSON”,“ORG”]
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer analyzer: “char_wb” min_ngram: 1 max_ngram: 4
- name: DIETClassifier epochs: 25
- name: EntitySynonymMapper
- name: ResponseSelector epochs: 25
Configuration for Rasa Core.
https://rasa.com/docs/rasa/core/policies/
policies:
- name: MemoizationPolicy
- name: TEDPolicy max_history: 5 epochs: 25
- name: MappingPolicy
- name: FormPolicy
- name: “FallbackPolicy” nlu_threshold: 0.4 core_threshold: 0.3 fallback_action_name: “action_default_fallback”
above is my config file which I used in project but still you will face problem in recognition of names because both the CRF AND SPACY are not meant to grab Indian names
for that you need a good data set for the Indian name entity recognition so this is my nlu.md file basically I used for my project
intent:inform
- my name is akshith
- my name is Chaitanya
- my name is damodar
- my name is ekani
- my name is fharan
- my name is ganesh
- my name is hyper
- my name is indra
- my name is Jaques
- my name is Jayanand
- my name is Kanta
- my name is Laksman
- my name is Madhukar
- my name is Nagesh
- my name is Om
- my name is Panduranga
- my name is Raju
- my name is Swami
- my name is bargav he’s
- my name is surya and i’m going to have a baby!
- my name is harshith
- my name is harshith he’s
- my name’s rama chandra your is ralph, i think
- my name is durga and i’m going to be honest with you
- my name’s priya and i hate my a lot, you know?
- i have to tell you, my name’s harichandra
- my name is ramesh i 'll be right back
- my name’s suresh my name’s coldblooded!
- my name’s ramchandra
- my name is shiva and i’m very much looking to meet you
- i have a person_name’s niklesh
- my name’s reddy and i have no idea whose house i’m at,
- my name is naveen namani
- my name’s harshith
- my name’s surya
- my name’s harshith i’m the one who taught him how to kill
so you will be observing why I have given a data set that has same intent
- my name is -----------------
if you see in depth I have given a data set that has followed all alphabetical order starts from
A to Z
so basically this works with the CRFEntityExtractor so this is used for custom entity extractor so this will basically extract all Indian names with the string my name is … so on
To make sure if the user enter only his name
Give the train data in nlu like this with alphabetical order of random Indian names that should cover A to Z
Stilll not sure with your bot so to boost up your confident level add lookup table with dataset ** names.txt (463.4 KB)
add this file to the data set of nlu.md
increment the probability of more than 18k Indian names (don’t blame what’s the point of deep learning then its our fault that Indian names are originated from naive language if you want develop a deep learning model )
lookup:person_name
- data/names.txt
ok cool then why we should use "spacy" name entity extractor on note the above one works for only Indian names to make it international's spacy is good at English names so we should us spacyentity extractor
But you will face problem with action server not extracting names its not with the extractor its with the actions.py code
As I early mentioned that we are using two entity extractor spacy , CRF so what if both entity extractor extract the entities then forms actions will return values in list then you will face problem in validating you entity so for that I have a solution if you wan to validate the other than the name slot
class loanForm(FormAction):
def name(self):
return "loan_form"
@staticmethod
def required_slots(tracker):
return [
"person_name",
"email_id",
"type_loan",
"phone_number",
]
def validate_phone_number(
self,
value: Text,
dispatcher: CollectingDispatcher,
tracker: Tracker,
domain: Dict[Text, Any],) -> Dict[Text, Any]:
#print(value)
li = []
if type(value)==str:
li.append(value)
else:
li = value
p2 = pattern()
for value in li:
if p2.search_phone_number(strign=value):
return {"phone_number":value}
else:
dispatcher.utter_message(text="Thats an incorrect format please enter a valid format")
return{"phone_number":None}
def validate_person_name(
self,
value: Text,
dispatcher: CollectingDispatcher,
tracker: Tracker,
domain: Dict[Text, Any],) -> Dict[Text, Any]:
li = []
if type(value)==str:
li.append(value)
else:
li = value
p3 = pattern()
for value in li:
if(p3.search_phone_number('^\\d+$',value)):
return {"person_name":None}
else:
return {"person_name":value}
def validate_email_id(
self,
value: Text,
dispatcher: CollectingDispatcher,
tracker: Tracker,
domain: Dict[Text, Any],) -> Dict[Text, Any]:
print(value)
li = []
if type(value)==str:
li.append(value)
else:
li = value
print(li)
p = pattern()
for value in li:
print(value)
if p.email_id_search(slot_value=value):
return {"email_id":value}
else:
dispatcher.utter_message(text="please enter a valid email id format xyz@xyz.com")
return{"email_id":None}
def slot_mappings(self):
return {"person_name":[self.from_entity(entity="person_name", intent="inform"),
self.from_entity(entity="PERSON", intent="inform"),
self.from_entity(entity="ORG", intent="inform"),
self.from_text(intent="inform"), # just to bet sure that if Both entity recognition fails in extracting text is the only slot left with ous
self.from_text(intent="greet"),]}
# this is one for validating the phone and name should not contain purely numbers with regex
class pattern:
def search_phone_number(self,pattern='^(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$',strign=None):
search1 = re.compile(pattern).search(strign)
if not search1:
print(False)
return False
else:
print(True)
return True
def email_id_search(self,pattern="^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$",slot_value=None):
search1 = re.compile(pattern).search(slot_value)
if not search1:
print(slot_value)
print(False)
return False
else:
print(True)
return True
so this my code for validation of entities because you may not with simple string as it return list if both entity are recognised
you may notice that I have taken the intent greet as consider to the person_name slot why the obeservation I made was that most of the south corean name be like "Ching - haee "so the bot classify it has greet with confidence level 0.999 to avoid this I have take the greet into "person slot purely optional" because it take and fills hi into slot so you can remove it
The above method basically worked for me hoping it works with all Indian names you can test the bot over here