Regex based entity Extraction

I want to extract entities of two types. One is ‘words’ entity which accepts alphaNumeric (including hiphens and underscores) strings. The regex for ‘words’ entity is [a-zA-Z0-9_\-]*.

Other is ‘multiWords’ entity which accepts sentences and words in double quotes. The regex for ‘multiWords’ entity is "[a-zA-Z0-9_\-][a-zA-Z0-9_\- ]*[a-zA-Z0-9_\-]".

Example: If my sentence is The king shouted “Let the game begin”. Then [The, king, shouted] should be extracted as words and [“Let the game begin”] should be extracted as multiplrWords entity.

This is my config file

Configuration for Rasa NLU.


language: en pipeline:

  • name: SpacyNLP case_sensitive: true
  • name: SpacyTokenizer
  • name: RegexFeaturizer
  • name: SpacyFeaturizer
  • name: CRFEntityExtractor
  • name: “regex.RegexEntityExtractor”
  • name: EntitySynonymMapper
  • name: SklearnIntentClassifier

Configuration for Rasa Core.



  • name: MemoizationPolicy
  • name: MappingPolicy

But I was not able to extract properly with CRFEntityExtractor and RegexEntityExtractor. Can anyone give some suggestions to do this task. Thanks in advance.

spacy tokenizer probably strips the quotes. For such a simple rules I don’t see the reason to use ML to extract entities. Just create a custom component that would do that