Email classification

Hi ,I am using RASA NLU in my work recently. My subject is to classifiy email on two intents and a lots of entities like name of person, account number, membership number, phone number. I have tried a lot of configuration of pipeline but i couldn’t find the pipeline who succeed to extract these entities… In one email, there is a lot of entities to extract but the result is not concluant…

Is there anyone who do the same subject and can give me some tips? Is There an order who is using by NLU to exctract entities because sometimes when i move a word before an other the NLU recognise it and sometimes not…

Thank you for helping me

What kind of pipeline are you currently using? Also can you maybe share some examples of your training data? That would make it easier to understand what kind of entities you are trying to extract. Also what language are you using?

In general we made quite good experience with the DIETClassifier. Often is already sufficient if you have a pipeline similar to

- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 100
- name: EntitySynonymMapper

Hello, i am using french language with this pipepline :slight_smile:

language: “fr” pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: “CRFEntityExtractor”
  • name: LexicalSyntacticFeaturizer
  • name: EntitySynonymMapper
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: “char_wb” min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 100 batch_size: 64 entity_recognition: False

I want to recognize Name of person, unmber account, membership number … but sometimes the NLU recognize some of these and sometimes no. It’s very random result… I have implemented some regex.

Thank you

Can you show some of the mistakes your model makes?

For example : ##################################################################### Test n° 1 en 650 ms pour “Bonjour, je suis Madame Ludivine, numéro d’adhérent : 1\1256325, je vous met ci-joint mon rib vous pouvez-joindre au : 06 32 14 89 12”:

Intentions trouvées :

  • modif_RIB (confiance : 0.9982733726501465)
  • autre (confiance : 0.0017266019713133574)

Entités trouvées :

  • civilite : “madame”**
  • numero_adherent : “1\1256325”**

Test n° 2 en 1111 ms pour “Bonjour, numéro d’adhérent : 1\1256325, je vous met ci-joint mon rib vous pouvez-joindre au : 06 32 14 89 12 Madame Ludivine”:

Intentions trouvées :

  • modif_RIB (confiance : 0.9998962879180908)
  • autre (confiance : 1.0371021926403046E-4)

Entités trouvées :

  • numero_adherent : “1\1256325”
  • civilite : “Madame Ludivine”

An other examples: Test n° 9 en 1021 ms pour “Nouvel identité bancaire, numéro : 1/125468954”:

Intentions trouvées :

  • modif_RIB (confiance : 1)
  • autre (confiance : 1.0737703044425007E-12)

Entités trouvées :

  • numero_adherent : “1/125468954”

Test n° 10 en 694 ms pour " numéro : 1/125468954 Nouvel identité bancaire":

Intentions trouvées :

  • modif_RIB (confiance : 1)
  • autre (confiance : 9.96091447172387E-13)

Entités trouvées :

#################################################### Civilite is Madame or Monsieur But the NLU don’t reconize the name of Ludivine and the phone number and sometimes he recognize one of them and sometimes no. Perhaps the orders matters ? I don’t know if you can give me some tips to enhance the entity extraction, it will be great.

It’s a very simple example, but in a lot of case, the e-mail is much longer than this one

How much training data do you have? E.g. how much examples per entity? It often helps just to add a couple of more examples to the training data.

Also if you want to extract number of a certain pattern, I recommend to either use duckling or RegexEntityExtractor. That should help you to extract entities that follow a certain pattern.

Also it might be a good idea to switch to DIETClassifier for entity extraction instead of CRFEntityExtractor as it is usually a bit more powerful.

So maybe you can try the following config and use the DIETClassifier to extract civilite and the RegexEntityExtractor to extract the phone number, for example.

- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 100
- name: RegexEntityExtractor
- name: EntitySynonymMapper
1 Like

Thank you, i am working on it. It 's already better. I am using this one now : language: “fr” pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: RegexEntityExtractor use_regexes: True
  • name: “CRFEntityExtractor”
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: “char_wb” min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 100
  • name: EntitySynonymMapper

I have a question: the order of compoment like regexEntityExtractor before DietClassifier matter or not ? Is it better to put regexEntityExtractor after DietClassifer ?