How to use regex patterns for entity recognition?

I am trying to use regex patterns in my training data to avoid hardcoding all possible entity values. My nlu.md file looks like this:

## regex:code
- ^[YEM][YPM]\d{2,3}$

## intent:course_code
- [YP12](code)
- [MY01](code)
- [EP11](code)

What I want to do is to recognize patterns with two characters and up to three digits from user input and identify them as code entities. When I train my bot and give this a try, it only recognizes the examples I’ve given to it under intent:course_code and fails when some new pattern is given. Here is my config.yml file:

pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor
    "features": [
    [
      "pattern",
    ],
  ]
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

I could just hardcode all the different values, but I don’t believe this is a very good solution. Can someone please help me with this?

Hi @stavr ,

I think your config file and Regex part is fine. The number of examples you gave for intent course_code is very less, you need to give atleast 10 - 15 examples for it to be effective.

Please check this Video.

You can also refer the documentation.

( Also, not sure if I’m right on this, but try shifting the ## regex:code after the ## intent:course_code.)

I don’t really know what your usecase is , but some things I feel maybe helpful for you :thinking:

  • If there is a list of courses (course code ) to choose from, you can maybe try Look up tables

  • To make extraction of entities like dates, amounts of money, distances, You can also try DucklingHTTPextractor

PS. Also make sure you’ve added enitites and intents to the domain.yml file

1 Like

Hello @_sanjay_r ,

Thanks a lot of replying and giving such a detailed answer! I found that the problem was a lot simpler than expected. While I was reading this article I noticed this little detail:

make sure RegexFeaturizer is in your nlp pipeline and present before CRFEntityExtractor

I checked my config.yml file and it wasn’t, so I made the changes and it is working perfectly now. I made sure to provide a few more examples too, as you mentioned :+1:t2:

I’m actually using a lookup table which has values of professor names for a different intent. This is how this looks:

## intent:professor_email
- what's the email of mr [Johnson](professor_names);
- email of mr [Smith](professor_names)
- can you tell me mrs [Jones](professor_names) email

My lookup table is in a separate professor_names.txt file and in my domain.yml I’ve defined a professor_names entity and slot. The entity extraction works perfectly when I provide professor names included in the training data, but when I enter one new it’s not accurate at all.

Do you believe this is a problem with the number of training examples? :thinking:

I was afraid that by providing many examples the bot would ignore the lookup table like mentioned here in the docs.

Any help would be highly appreciated :smiley:

1 Like

Hey @stavr,

(Editing this, I was wrong on my initial reply about the whole concept of lookup tables. I apologize. )

Extracting names is a very difficult task, Esp as it varies very widely and there is no fixed pattern or anything.Even I am struggling with this. But do have a look at CRFEntityExtractor

But other than that, run rasa with a --debug command. Then check if entities are getting extracted. If not, I guess you’ll have to provide more examples from the lookup table to training data. As it says in the documentation :

For lookup tables to be effective, there must be a few examples of matches in your training data. Otherwise the model will not learn to use the lookup table match features.

Also, (and I’m not sure about this ) tokenization may be case sensitive.So maybe entity extraction fails because of this. :thinking: To overcome this, if you are using WhiteSpaceTokenizer, add "case_sensitive": False.( not sure about this)

Do let me know if you found an apt solution somewhere.

Cheers.

it appears regex could be before intent according to the documentation here NLU Training Data