Family name extraction

Anyone please ?

Hi @forwitai. Are you using SpaCy component to extract names as an entity PERSON? I am pretty sure the pre-trained model should extract names and family names as separate entities (one for a name and another one for a family name). Are you dealing with English names here or a different country names?

Hello @Juste, thank you for the reply ! In fact, Spacy would have been really helpful if I was dealing with english names, but I’m not. I’m dealing with Arabic, French and English names and Spacy barely extract some of them. So I’m using a combination of CRFEntityExtractor trained on some nlu data, spacy and lookup tables. The results seems fine by now. The problem is with family names, you just can’t put them in a lookup table, the family name can be in either arabic, french or english and can be one word, two or even three words.

I ask the user what your first name ? then what’s your last name ? I’m thinking about extacting the whole answer as a last name, but it isn’t a very smart approach, as the user might ask about something else or say whatever crosses his mind.

Can you help me with this ? any ideas ?

Thanks a lot !

Hi @forwitai,

my name is Vincent. I’m working on a library that contains NLU components that should make it easier for non-English users to make assistants with Rasa. The project is called rasa nlu examples and it can be found here; Rasa NLU Examples.

I don’t speak Arabic but I’d love learn more about the problems that you’re encountering. My guess is that there might be something to improve at the embedding part of your project. Unfortunately Duckling doesn’t work for Arabic and I fear that spaCy support is also modest at the moment.

I can’t speak Arabic unfortunately, but I’m trying to support for tools to rasa nlu examples cover non-English languages like Arabic.

In particular I’ve added support for the following tools for Arabic;

If you can share a config.yml pipeline that you’ve tried locally and perhaps part of your nlu.yml file then I might be able to help you in more detail.

Also, I’ll be speaking at PyData Riyahd this Wednesday. My goal is to have a talk with folks to see if there’s more tools that I can make available for Arabic. Feel free to join in to exchange ideas!

I’ve also been reminded that there are Huggingface models for Arabic that might be worth trying out. It’s been suggested that "kuisailab/albert-base-arabic" (github link) might be worth to try and I think that we support it natively via our huggingface component.

You might be able to use whatlies to play around with these embedding beforehand.

from whatlies.language import HFTransformersLanguage
from whatlies.transformers import Pca, Umap

hf_lang = HFTransformersLanguage("kuisailab/albert-base-arabic")

text = [
    "قط","الكلب","الفأر","رجل","النساء","ملك","ملكة"
]

hf_lang[text].transform(Pca(2)).plot_interactive(annot=True)

If you replace text with words and names, you should see that the names appear in a different cluster.

Hi @koaning, thank you so much for your answer ! Actually I’m not dealing with Arabic letters, but rather Arabic names written in latin letters, for example “Maryam”, “Alaa”, “Samir”, etc. Those names and the french/english names.

My Chatbot is trained in french, I’ve managed to extract first names with a combination of CRFEntityExtractor, Spacy and lookup tables. But for last names, it’s much harder ! Last names can be composed of 1, 2 or more words, they are very different depending on the country and you can’t just enumerate them in a lookup table, that’s where I’m stuck !

I’m wondering if you might be able to “apply a hack” then.

A first name is usually followed by a last name, no? You might be able to write a custom NLU component that looks at the current tokens/entities and appends where needed.

Got a small chunk of example data? Also, what spaCy model are you using? What exactly are you doing in your lookup tables?

Another thing to point out; is there a reason why you’re not using a form?

I’m actually using a form and trying to extract first and last names inside ! The bot first asks what’s your first name ? then what’s your last name ? and is supposed to extract both separately.

Concerning Spacy, I’m using “fr_core_news_sm”, it gives good results with french and english first names. I use a lookup table to extract arabic names written in latin letters, I’ve build a sort of database with around 3000 names.

Here’s an example of my nlu data for first names :

  • Je m’appelle Sarah
  • Mon prénom est Maria
  • Sophie
  • Ok, voici mon prénom : Camélia
  • Meriem
  • Je m’appelle Malika

And here’s an example of my nlu data for last names :

  • Smith
  • C’est Williams
  • Mon nom est Alaoui
  • Le voici : El Kamali
  • C’est Abadi
  • Saqqaf
  • mon nom est Al Andaloussi

Regarding your idea of asking for the first and last name at the same time, the user might say his name is “Will Smith” or “Smith Will”, his last name might be composed of 2 or 3 words, so I wouldn’t know how to locate his last name, especially if the first letters are not capital, even pos tagging wouldn’t help.

I hope my case is clearer now to you, any advice is more than welcome !

There are some more ideas that come to mind. Have you tried fr_core_news_md and fr_core_news_lg? These models may have better performance because they are trained on more data. It also deserves mentioning that soon-ish spaCy 3.0 will be officially released and that Rasa will start to support the new French transformer models that are inside. These might also help out here.

Part of me is wondering if it might be simpler to just brute force it by adding a larger database. A quick google resulted in this website. We have a fancy new RegexEntityExtractor in Rasa 2.0 that might be of help here too.

There are some more ideas that come to mind. Have you tried fr_core_news_md and fr_core_news_lg ? These models may have better performance because they are trained on more data. It also deserves mentioning that soon-ish spaCy 3.0 will be officially released and that Rasa will start to support the new French transformer models that are inside. These might also help out here.

I did try fr_core_news_md and it gives pretty much similar results as fr_core_new_sm in my case.

Part of me is wondering if it might be simpler to just brute force it by adding a larger database. A quick google resulted in this website. We have a fancy new RegexEntityExtractor in Rasa 2.0 that might be of help here too.

I think that would depend on the origin of the surnames, I’ll give it a shot !

Thanks a lot @koaning !

Pragmatically though, I might solve part of your issue by introducing a conversation design. You might be able to design stories like;

[bot] What is your name? 
[person] Vincent D. Wamerdam
[bot] Just to check, is your name Vincent D. ? If it isn't please write your first and last name again. 
[person] Vincent Wamerdam
[bot] Just to check, is your name Vincent Warmerdam? If it isn't please write your first and last name again. 
[person] This is correct 
[bot] ... continues conversation...

Again, this is me thinking out loud. But something like this feels “safe”. I’m assuming that it is very important to get the name 100% right so it might be best to add an extra step. This way the user can also catch any spelling errors.

I’ve thought about confirming the first and last names too, but I was afraid it would be a little inconvenient for the user !

It’s certainly a fair concern, but at the same time, I also imagine that a misspelled name in a database can be an even larger concern.

You’re probably right, I will try to apply your advices.

Thank you so much Vincent, I really appreciate your help !

1 Like

@forwitai this thread inspired me to maybe build a component to help you out. I’ll pitch you an idea and if you think it makes sense I’ll start working on it.

  1. I think it would be very helpful if I started hosting/open-sourcing name lists. What you’re describing here is a general problem but if we can host the top 50K names per country then I can imagine that we can at least provide a reasonable starting point. You can use a RegexEntityExtractor to fetch all of these names.
  2. Names can still be misspelled so maybe we need to have support for “Fuzzy Matching”. What I could do is I could make a variant of the RegexEntityExtractor that matches the names in the lookup table but allows for slight misspellings.

Would this be useful? Also, would you be able to collaborate by perhaps sharing your name-list?

I’ve created a github issue here so feel free to discuss there as well.

This sounds like an interesting idea for first names, I’m not so sure about last names though.

Concerning the misspellings, I’ve used a Python package to return the closest matches and thus allow any kind of misspellings (for the names that can be written in different ways, it’s the case for the majority of Arabic names).

And sure, I’ll be more than glad to collaborate !

I’ve just added a small Arabic name list and larger lists for German, US. These larger name lists also list some international names, but it won’t be perfect.

I’m also exploring tools like Faker to supply lists of names. For Arabic though, it uses the Arabic alphabet and there isn’t a locale available for the Roman one.

You can have a look here. If name-lists for last names also exists I’ll gladly add these too.

I haven’t worked on last names yet as I’m working on other projects also, but I’ll let you know if I end up creating a list !

For first names, I’ve already came across the Wikipedia list, but it was not really useful for me since it contains none to few names of my country. Where can I upload the list I’ve created ?

You can create a PR for the rasa nlu examples repository. In particular, you can add the list to the appropriate folder in here.

Done !

1 Like