Need clarity RASA Regex


I am working on a chat bot which requires the user to enter a lot of unique id. For example account number or complaint number similar to xx0934345, xa237472834, xb3453

To extract these id I made use of regular expressions and it works too but what i really don’t understand is regarding examples. According to some forum post the regex needs some examples to work but that won’t works as expected. Example I have given few examples like

  • my account number is xa4538475
  • xa223423904 is my account number
  • compliant id xc23423
  • compaint request id is xc3242384

Model based on these example will only extract patterns like xa23423 or xc678678 but examples like xk324234 or xu234234 or xo234234 will not work since there are no similar example in the NLU data. As per my regex it should identify and extract all patterns which starts with two alphabet followed by numbers.

See if there is any way to handle it @akelad @dakshvar22 @Juste

Whenever you have the RegexFeaturizer in your NLU pipeline, Rasa is looking for matching candidates of your defined regular expressions in the text. However, Rasa will not just extract them as entities, but will create features, e.g. word matches regex for account or not. Those features will be added to the features used for the CRF (model to extract entities). Thus, you need to add some examples to the NLU data, so that the CRF can learn that those features are relevant to determine whether a word is an entity or not.

So, maybe double check if the RegexFeaturizer is in your pipeline and try to add some more examples to the NLU data.


This just goes against the whole idea of using regular expression.

Have included the RegexFeaturizer in the pipeline. It is capturing regex also but only capturing regex similar to the examples specified in the NLU training data. Adding more examples to training data helps but it will make the training data huge. And in our case we cannot include all possible combination of regex in the data. Because it may vary from time to time.

For example, Now it may be ab000123, ac234145, ad3456789 but later it may change it to zw4567257, zt7531598 etc.

How to take care of such cases.

@surya7592 Maybe, we can add a flag to the RegexFeaturizer that indicates whether to add the regex as feature or to take the matches directly as entities. What do you think? Can you open an issue for that on GitHub?