I am working on a chat bot which requires the user to enter a lot of unique id. For example account number or complaint number similar to
xx0934345, xa237472834, xb3453
To extract these id I made use of regular expressions and it works too but what i really don’t understand is regarding examples. According to some forum post the regex needs some examples to work but that won’t works as expected.
Example
I have given few examples like
my account number is xa4538475
xa223423904 is my account number
compliant id xc23423
compaint request id is xc3242384
Model based on these example will only extract patterns like xa23423 or xc678678
but examples like xk324234 or xu234234 or xo234234 will not work since there are no similar example in the NLU data. As per my regex it should identify and extract all patterns which starts with two alphabet followed by numbers.
Whenever you have the RegexFeaturizer in your NLU pipeline, Rasa is looking for matching candidates of your defined regular expressions in the text. However, Rasa will not just extract them as entities, but will create features, e.g. word matches regex for account or not. Those features will be added to the features used for the CRF (model to extract entities). Thus, you need to add some examples to the NLU data, so that the CRF can learn that those features are relevant to determine whether a word is an entity or not.
So, maybe double check if the RegexFeaturizer is in your pipeline and try to add some more examples to the NLU data.
This just goes against the whole idea of using regular expression.
Have included the RegexFeaturizer in the pipeline. It is capturing regex also but only capturing regex similar to the examples specified in the NLU training data.
Adding more examples to training data helps but it will make the training data huge. And in our case we cannot include all possible combination of regex in the data. Because it may vary from time to time.
For example, Now it may be ab000123, ac234145, ad3456789 but later it may change it to zw4567257, zt7531598 etc.
@surya7592 Maybe, we can add a flag to the RegexFeaturizer that indicates whether to add the regex as feature or to take the matches directly as entities. What do you think? Can you open an issue for that on GitHub?