Indian name recognition.(name entity recognition) (Regional name recognition) works best in recognition of name

hey @community @tyd @stephens I was looking a best and easy way for name recognition meant purely for Indian names and international names

Existing pipe lines

There are many pipelines that are not so efficient to grab name entity “Spacy” works for us originate names or English names but when it comes to Indian names people type there name which is converted from there native language to English so all the pipelines fails in grabbing name entity

so my solution for the bot to recognise the name(Indian) is to use custom CRFEntityExtractor and SpacyEntityExtractor so basically we will be using the composite of these two entity extractor config

# Configuration for Rasa NLU.

https://rasa.com/docs/rasa/nlu/components/

language: “en”

pipeline:

  • name: SpacyNLP model: “en_core_web_lg” case_sensitive: False
  • name: ConveRTTokenizer
  • name: ConveRTFeaturizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CRFEntityExtractor
  • name: SpacyEntityExtractor dimensions: [“PERSON”,“ORG”]
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: “char_wb” min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 25
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 25

Configuration for Rasa Core.

https://rasa.com/docs/rasa/core/policies/

policies:

  • name: MemoizationPolicy
  • name: TEDPolicy max_history: 5 epochs: 25
  • name: MappingPolicy
  • name: FormPolicy
  • name: “FallbackPolicy” nlu_threshold: 0.4 core_threshold: 0.3 fallback_action_name: “action_default_fallback”

above is my config file which I used in project but still you will face problem in recognition of names because both the CRF AND SPACY are not meant to grab Indian names

for that you need a good data set for the Indian name entity recognition so this is my nlu.md file basically I used for my project

intent:inform

so you will be observing why I have given a data set that has same intent

  • my name is -----------------

if you see in depth I have given a data set that has followed all alphabetical order starts from

A to Z

so basically this works with the CRFEntityExtractor so this is used for custom entity extractor so this will basically extract all Indian names with the string my name is … so on

To make sure if the user enter only his name

Give the train data in nlu like this with alphabetical order of random Indian names that should cover A to Z

Stilll not sure with your bot so to boost up your confident level add lookup table with dataset ** names.txt (463.4 KB)

add this file to the data set of nlu.md

increment the probability of more than 18k Indian names (don’t blame what’s the point of deep learning then its our fault that Indian names are originated from naive language if you want develop a deep learning model )

lookup:person_name

 - data/names.txt

ok cool then why we should use "spacy" name entity extractor on note the above one works for only Indian names to make it international's spacy is good at English names so we should us spacyentity extractor

But you will face problem with action server not extracting names its not with the extractor its with the actions.py code

As I early mentioned that we are using two entity extractor spacy , CRF so what if both entity extractor extract the entities then forms actions will return values in list then you will face problem in validating you entity so for that I have a solution if you wan to validate the other than the name slot

class loanForm(FormAction):

def name(self):
    return "loan_form"

@staticmethod
def required_slots(tracker):
    return [
    "person_name",
    "email_id",
    "type_loan",
    "phone_number",
    ]

def validate_phone_number(
    self,
    value: Text,
    dispatcher: CollectingDispatcher,
    tracker: Tracker,
    domain: Dict[Text, Any],) -> Dict[Text, Any]:

    #print(value)

    li = []
    if type(value)==str:
        li.append(value)
    else:
        li = value
    p2 = pattern()
    for value in li:
        if p2.search_phone_number(strign=value):
            return {"phone_number":value}
        else:
            dispatcher.utter_message(text="Thats an incorrect format please enter a valid format")
            return{"phone_number":None}


def validate_person_name(
    self,
    value: Text,
    dispatcher: CollectingDispatcher,
    tracker: Tracker,
    domain: Dict[Text, Any],) -> Dict[Text, Any]:
    li = []
    if type(value)==str:
        li.append(value)
    else:
        li = value
    p3 = pattern()
    for value in li:
        if(p3.search_phone_number('^\\d+$',value)):
            return {"person_name":None}
        else:
            return {"person_name":value}

def validate_email_id(
    self,
    value: Text,
    dispatcher: CollectingDispatcher,
    tracker: Tracker,
    domain: Dict[Text, Any],) -> Dict[Text, Any]:

    print(value)
    li = []
    if type(value)==str:
        li.append(value)
    else:
        li = value
    print(li)
    p = pattern()
    for value in li: 
        print(value)   
        if p.email_id_search(slot_value=value):
            return {"email_id":value}
        else:
            dispatcher.utter_message(text="please enter a valid email id format xyz@xyz.com")
            return{"email_id":None}

def slot_mappings(self):
    return {"person_name":[self.from_entity(entity="person_name", intent="inform"),
                           self.from_entity(entity="PERSON", intent="inform"),
                           self.from_entity(entity="ORG", intent="inform"),
                           self.from_text(intent="inform"), # just to bet sure that if Both entity recognition fails in extracting text is the only slot left with ous 
                           self.from_text(intent="greet"),]}
# this is one for validating the phone and name should not contain purely numbers with regex
class pattern:
def search_phone_number(self,pattern='^(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$',strign=None):
    search1 = re.compile(pattern).search(strign)
    if not search1:
        print(False)
        return False
    else:
        print(True)
        return True 

def email_id_search(self,pattern="^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$",slot_value=None):
    search1 = re.compile(pattern).search(slot_value)
    if not search1:
        print(slot_value)
        print(False)
        return False
    else:
        print(True)
        return True 

so this my code for validation of entities because you may not with simple string as it return list if both entity are recognised

you may notice that I have taken the intent greet as consider to the person_name slot why the obeservation I made was that most of the south corean name be like "Ching - haee "so the bot classify it has greet with confidence level 0.999 to avoid this I have take the greet into "person slot purely optional" because it take and fills hi into slot so you can remove it

The above method basically worked for me hoping it works with all Indian names you can test the bot over here

http://bigdatamatica.tk

2 Likes

hi @sheggam_harshith. nice solution there. I am trying to use the lookup table for the names file you have provided , but it is not recognising it

## intent:name_entry

- [Benjamin](name)

- [Peter](name)

- [Jitendra](name)

- [Sam](name)

- [Ankit](name)

- [John](name)

- [Jiterder Sagar](name)

- [R Benjamin Franklin](name)

- [Ankit Kumar Mishra](name)

- [ankit](name)

- [Daniel](name)

- [Amit](name)

- [Rohit](name)

- my name is [sam](name)

- i am [roshan](name)

- i am [jhonny](name)

- myself [boris johnson](name)

- i call myself [rahul](name)

- I am [Suraj](name)

- this is [neha](name)

- [kajal](name) here

- This is [bigan mehto](name)

- [kiren](name) here

- [seema](name) here

- my name is [farheen](name)

## lookup:name

    data\names.txt

Can you tell what could be the reason.

Can you post your pipelines you have used in your project and see and core nlu output

1 Like

sure @sheggam.

Configuration for Rasa NLU.

Components

language: en

pipeline:

  • name: WhitespaceTokenizer

  • name: RegexFeaturizer

  • name: LexicalSyntacticFeaturizer

  • name: CountVectorsFeaturizer

  • name: CountVectorsFeaturizer

    analyzer: “char_wb”

    min_ngram: 1

    max_ngram: 4

  • name: DIETClassifier

    epochs: 100

  • name: EntitySynonymMapper

  • name: ResponseSelector

    epochs: 100

Configuration for Rasa Core.

Policies

policies:

  • name: MemoizationPolicy

  • name: TEDPolicy

    max_history: 5

    epochs: 100

  • name: MappingPolicy

Output is coming out as Hello None!(it shoudl be “Hello name_of_person”)

Configuration for Rasa NLU.

Components

language: “en”

pipeline:

  • name: SpacyNLP model: “en_core_web_lg” case_sensitive: False
  • name: “SpacyTokenizer”
  • name: “SpacyFeaturizer” “pooling”: “mean”
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: DucklingHTTPExtractor url: http://localhost:8000 dimensions:
    • email
  • name: SpacyEntityExtractor dimensions: [“PERSON”, “MONEY”,“ORG”]
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: “char_wb” min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 25
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 25

policies:

  • name: MemoizationPolicy
  • name: TEDPolicy max_history: 5 epochs: 25
  • name: MappingPolicy
  • name: FormPolicy
  • name: “FallbackPolicy” nlu_threshold: 0.4 core_threshold: 0.3 fallback_action_name: “action_default_fallback”

Hi @sheggam_harshith. Great solution. Thanks.

Any reason you used CRFEntityExtractor + Spacy instead of DIETClassifier + Spacy? Which one is a better custom entity extractor for Indian names, CRF or DIET?

Could you please share your nlu and domain screenshots.

hai @RBenjaminfranklin , i am also facing same issues. if you resolved the issues plz help me