How to use regex patterns for entity recognition?

stavr · April 15, 2020, 3:49pm

I am trying to use regex patterns in my training data to avoid hardcoding all possible entity values. My nlu.md file looks like this:

## regex:code
- ^[YEM][YPM]\d{2,3}$

## intent:course_code
- [YP12](code)
- [MY01](code)
- [EP11](code)

What I want to do is to recognize patterns with two characters and up to three digits from user input and identify them as code entities. When I train my bot and give this a try, it only recognizes the examples I’ve given to it under intent:course_code and fails when some new pattern is given. Here is my config.yml file:

pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor
    "features": [
    [
      "pattern",
    ],
  ]
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

I could just hardcode all the different values, but I don’t believe this is a very good solution. Can someone please help me with this?

_sanjay_r · April 15, 2020, 9:22pm

Hi @stavr ,

I think your config file and Regex part is fine. The number of examples you gave for intent course_code is very less, you need to give atleast 10 - 15 examples for it to be effective.

Please check this Video.

You can also refer the documentation.

( Also, not sure if I’m right on this, but try shifting the ## regex:code after the ## intent:course_code.)

I don’t really know what your usecase is , but some things I feel maybe helpful for you

If there is a list of courses (course code ) to choose from, you can maybe try Look up tables
To make extraction of entities like dates, amounts of money, distances, You can also try DucklingHTTPextractor

PS. Also make sure you’ve added enitites and intents to the domain.yml file

stavr · April 16, 2020, 8:13am

Hello @_sanjay_r ,

Thanks a lot of replying and giving such a detailed answer! I found that the problem was a lot simpler than expected. While I was reading this article I noticed this little detail:

make sure RegexFeaturizer is in your nlp pipeline and present before CRFEntityExtractor

I checked my config.yml file and it wasn’t, so I made the changes and it is working perfectly now. I made sure to provide a few more examples too, as you mentioned

I’m actually using a lookup table which has values of professor names for a different intent. This is how this looks:

## intent:professor_email
- what's the email of mr [Johnson](professor_names);
- email of mr [Smith](professor_names)
- can you tell me mrs [Jones](professor_names) email

My lookup table is in a separate professor_names.txt file and in my domain.yml I’ve defined a professor_names entity and slot. The entity extraction works perfectly when I provide professor names included in the training data, but when I enter one new it’s not accurate at all.

Do you believe this is a problem with the number of training examples?

I was afraid that by providing many examples the bot would ignore the lookup table like mentioned here in the docs.

Any help would be highly appreciated

_sanjay_r · April 16, 2020, 7:19pm

Hey @stavr,

(Editing this, I was wrong on my initial reply about the whole concept of lookup tables. I apologize. )

Extracting names is a very difficult task, Esp as it varies very widely and there is no fixed pattern or anything.Even I am struggling with this. But do have a look at CRFEntityExtractor

But other than that, run rasa with a --debug command. Then check if entities are getting extracted. If not, I guess you’ll have to provide more examples from the lookup table to training data. As it says in the documentation :

For lookup tables to be effective, there must be a few examples of matches in your training data. Otherwise the model will not learn to use the lookup table match features.

Also, (and I’m not sure about this ) tokenization may be case sensitive.So maybe entity extraction fails because of this. To overcome this, if you are using WhiteSpaceTokenizer, add "case_sensitive": False.( not sure about this)

Do let me know if you found an apt solution somewhere.

Cheers.

ltfschoen · December 4, 2022, 6:19am

it appears regex could be before intent according to the documentation here NLU Training Data

Topic		Replies	Views
Has anyone successfully implemented strict regex patterns for entity extraction? Rasa Open Source	1	252	July 3, 2023
Rasa regex Rasa Open Source	5	651	February 23, 2022
Help in using regex feature in rasa_nlu Rasa Open Source	10	3311	December 11, 2018
Can't extract regex into entity Rasa Open Source	7	1235	February 18, 2022
Regex matching for entities vs. featurizer Rasa Open Source	4	1365	September 4, 2020

How to use regex patterns for entity recognition?

Related topics