How to use regex patterns for entity recognition?

I am trying to use regex patterns in my training data to avoid hardcoding all possible entity values. My file looks like this:

## regex:code
- ^[YEM][YPM]\d{2,3}$

## intent:course_code
- [YP12](code)
- [MY01](code)
- [EP11](code)

What I want to do is to recognize patterns with two characters and up to three digits from user input and identify them as code entities. When I train my bot and give this a try, it only recognizes the examples I’ve given to it under intent:course_code and fails when some new pattern is given. Here is my config.yml file:

  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor
    "features": [
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

I could just hardcode all the different values, but I don’t believe this is a very good solution. Can someone please help me with this?

Hi @stavr ,

I think your config file and Regex part is fine. The number of examples you gave for intent course_code is very less, you need to give atleast 10 - 15 examples for it to be effective.

Please check this Video.

You can also refer the documentation.

( Also, not sure if I’m right on this, but try shifting the ## regex:code after the ## intent:course_code.)

I don’t really know what your usecase is , but some things I feel maybe helpful for you :thinking:

  • If there is a list of courses (course code ) to choose from, you can maybe try Look up tables

  • To make extraction of entities like dates, amounts of money, distances, You can also try DucklingHTTPextractor

PS. Also make sure you’ve added enitites and intents to the domain.yml file

Hello @_sanjay_r ,

Thanks a lot of replying and giving such a detailed answer! I found that the problem was a lot simpler than expected. While I was reading this article I noticed this little detail:

make sure RegexFeaturizer is in your nlp pipeline and present before CRFEntityExtractor

I checked my config.yml file and it wasn’t, so I made the changes and it is working perfectly now. I made sure to provide a few more examples too, as you mentioned :+1:t2:

I’m actually using a lookup table which has values of professor names for a different intent. This is how this looks:

## intent:professor_email
- what's the email of mr [Johnson](professor_names);
- email of mr [Smith](professor_names)
- can you tell me mrs [Jones](professor_names) email

My lookup table is in a separate professor_names.txt file and in my domain.yml I’ve defined a professor_names entity and slot. The entity extraction works perfectly when I provide professor names included in the training data, but when I enter one new it’s not accurate at all.

Do you believe this is a problem with the number of training examples? :thinking:

I was afraid that by providing many examples the bot would ignore the lookup table like mentioned here in the docs.

Any help would be highly appreciated :smiley:

Hey @stavr,

(Editing this, I was wrong on my initial reply about the whole concept of lookup tables. I apologize. )

Extracting names is a very difficult task, Esp as it varies very widely and there is no fixed pattern or anything.Even I am struggling with this. But do have a look at CRFEntityExtractor

But other than that, run rasa with a --debug command. Then check if entities are getting extracted. If not, I guess you’ll have to provide more examples from the lookup table to training data. As it says in the documentation :

For lookup tables to be effective, there must be a few examples of matches in your training data. Otherwise the model will not learn to use the lookup table match features.

Also, (and I’m not sure about this ) tokenization may be case sensitive.So maybe entity extraction fails because of this. :thinking: To overcome this, if you are using WhiteSpaceTokenizer, add "case_sensitive": False.( not sure about this)

Do let me know if you found an apt solution somewhere.


it appears regex could be before intent according to the documentation here NLU Training Data