Regex not Working for Training Data

Hi, I have an entity called NPI which is a 10-digit numeric entry. I have more than 150 training examples for NPI and also a regex.

## intent:inform
- [9009548846](npi)
- [9034548846](npi)
- [9899548846](npi)
- [7909548846](npi)
- [2609548846](npi)
- [9809548846](npi)

## regex:npi
 - [0-9]{10}

But, when I enter numeric entries like 12345, 123456, 12345678 which are less than 10 digits, it is still being recognized as the entity - NPI
How to resolve this?

This is my pipeline configuration

language: en
pipeline:
  - name: ConveRTTokenizer
  - name: ConveRTFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 200
  - name: EntitySynonymMapper
  - name: ResponseSelector
    retrieval_intent: smalltalk
    epochs: 200
    scale_loss: false

hi @Akhil - can you please try removing the LexicalSyntacticFeaturizer from your pipeline? this adds a feature to your feature vector which says ‘is this token a number’, which is probably the cause. You might also try adding some NLU examples where you have numbers which are not nine digits long and which are not highlighted as entites. Then your model has a chance to understand that the length is important

1 Like

Hi @amn41.

I tried your suggestion but didn’t help. I have another entity called Claim ID which is an alpha-numeric string of length 1-50. I have around 150 alphanumeric strings generated by using a random string generator and also a regex like this.

## intent:inform
- [E9wwA6g2usABDHJi5r6H6zONyqWFQjg7f4](claim_id)
- [vgvsq9iJF2NZZ3G5FkcO4o](claim_id)
- [fUyUG0IXUWMW0sM3TUqgM2N4505I](claim_id)
- [P0DHfr7v029y25t9V2M](claim_id)
- [Bb0UO63Bs2e8Rj44MJGmj9ttLgGDd756Ph](claim_id)
- [fCdRDN86qZeMN0l0r12R8c4](claim_id)
- [BDjFOHoYB3VAT3CB](claim_id)
- [KJYh8kN0B2NED5Yvb](claim_id)
- [9009548846](npi)
- [9034548846](npi)
- [9899548846](npi)
- [7909548846](npi)
- [2609548846](npi)
- [9809548846](npi)

## regex:claim_id
- [a-zA-Z0-9]{1,50}

## regex:npi
- [0-9]{10}

After removing LexicalSyntacticFeaturizer , training it and running rasa shell nlu this is the result.

The input for claim_id is getting split and being recognized as 3 entities :zipper_mouth_face:. How to solve this issue? Should I stop using conveRT and shift to WhitespaceTokenizer or SpacyNLP?

Hi @amn41.

Any suggestion?

I used Spacy instead of conveRT and the claim IDs and npis are being extracted perfectly without getting split. but there is intent misclassification with Spacy(probably because of low training data.)

The conveRT is performing very well in Intent classification but splitting entities in entity extraction. Please suggest a way to resolve this using conveRT without the entities getting split.

hi @Akhil - you may have seen we are working towards the 2.0 release & there are already a few alpha releases out there. 2.0 includes a RegexEntityExtractor component which should do exactly what you want.

Hi @amn41. Is there a way I could do it in 1.10.12? Anyway I could use it as a custom component in my current pipeline?

yes, for sure! It should be an easy component to copy over, then you can use it as a custom component as you suggest

Hi @amn41. I added it as a custom component and it works as expected. But, I see that many entities of the same name are extracted. Is there a way I can specify the name of the entity extractor in slot mappings?

I can turn off entity extraction for diet but I need it for person_name entity extraction.

 "entities": [
    {
      "entity": "claim_id",
      "start": 0,
      "end": 10,
      "value": "3232046319",
      "extractor": "DIETClassifier"
    },
    {
      "entity": "claim_id",
      "start": 0,
      "end": 10,
      "value": "3232046319",
      "extractor": "RegexEntityExtractor"
    },
    {
      "entity": "npi",
      "start": 0,
      "end": 10,
      "value": "3232046319",
      "extractor": "RegexEntityExtractor"
    }
  ],

hi @Akhil - you should be able to remove the entity annotations for the claim_id entity from your training data, since the regex extractror doesn’t use them anyway.

Hi @amn41. Ohh ok. Then, in that case, I can also remove the npi annotations right? Since npi also has a regular expression similar to claim ID.

Is this a correct understanding?

yes

Hi @amn41. I removed all examples of claim_id and npi as you suggested. I am getting the following warning message and no entities are being extracted in rasa shell nlu.

2020-09-09 15:16:01 INFO     rasa.nlu.model  - Starting to train component RegexEntityExtractor
/home/akhil/office/chatbot/dev/venv/lib/python3.6/site-packages/rasa/utils/common.py:363: UserWarning: No lookup tables or regexes defined in the training data that have a name equal to any entity in the training data. In order for this component to work you need to define valid lookup tables or regexes in the training data.

This is my nlu file.

## intent:inform
- My first name is [James](name)
- my first name is [Sage](name)
- my first name is [Louis](name)
- my first name is [Kris](name)
- my first name is [Graciela](name)
- my first name is [Cammy](name)
- my first name is [Mattie](name)
- my first name is [Darakjy](name)
- my first name is [Venere](name)
- my first name is [Yuki](name)
- my firstname is [Leota](name)
- Ok, it is [Minna](name)
- Its [Donette](name)
- It is [Abel](name)
- my first name is oov
- my firstname is oov
- Ok, it is oov
- Its oov
- It is oov
- oov
- [Josephine](name)
- [Lenna](name)
- [Mitsue](name)
- [Sage](name)
- [Kris](name)
- [Kiley](name)
- [Graciela](name)
- [Cammy](name)
- [Mattie](name)
- [Meaghan](name)
- [Gladys](name)
- [Yuki](name)
- [Fletcher](name)
- [Bette](name)
- [Veronika](name)
- [Butt](name)
- [Darakjy](name)
- [Venere](name)
- [Paprocki](name)
- [Foller](name)
- [Morasca](name)
- [Tollner](name)
- [Dilliard](name)
- [Wieser](name)
- [Marrier](name)
- [Amigon](name)
- [Maclead](name)
- [Caldarera](name)
- [Ruta](name)
- [Albares](name)
- [Poquette](name)
- [Garufi](name)
- [Rim](name)
- [Whobrey](name)
- [Michael](name)
- [Benjamin](name)
- [Alexander](name)
- [Daniel](name)
- [John](name)
- [Adam](name)
- [Smith](name)  
- [Johnson](name)  
- [Williams](name)  
- [Jones](name)  
- [Brown](name)
- [Davis](name)  
- [Miller](name)
- [Wilson](name)  
- [Moore](name)  
- [Taylor](name)  
- [Anderson](name)  
- [Thomas](name)  
- [Jackson](name)  
- [White](name)
- [shridhar](name)
- [ashutosh](name)
- [abhash](name)
- [saurabh](name)
- [tushar](name)
- [srieram](name)
- [surajit](name)
- [mayank](name)
- [siddhant](name)
- [ayush](name)
- [yash](name)
- [aman](name)
- [sagar](name)
- [govind](name)
- [mudit](name)
- [subodh](name)
- [ankit](name)
- [arjun](name)
- [aditya](name)
- [shabham](name)
- [surjeet](name)
- [deepak](name)
- [mohit](name)
- [sri](name)
- [kunal](name)
- [mohammad](name)
- [sanjish](name)
- [mohd](name)
- [nakul](name)
- [benazeer](name)
- [ravi](name)
- [rahul](name)
- [ankuran](name)
- [faiz](name)
- [hemant](name)
- [naman](name)
- [kishan](name)
- [anuj](name)
- [jayan](name)
- [rohit](name)
- [diljeet](name)
- [chandan](name)
- [shubham](name)
- [siddharth](name)
- [himanshu](name)
- [rajesh](name)
- [upendra](name)
- [mukta](name)
- [shreyansh](name)
- [avinas](name)
- [anchal](name)
- [gaurav](name)
- [rakesh](name)
- [bijesh](name)
- [abinav](name)
- [vinod](name)
- [devansh](name)
- [jitender](name)
- [sandeep](name)
- [shikhar](name)
- [prakesh](name)
- [diwakar](name)
- [pratyesh](name)
- [nishant](name)
- [krishan](name)
- [tusher](name)
- [ankur](name)
- [satyendra](name)
- [raj](name)
- [arun](name)
- [kumar](name)
- [sai](name)
- [vivek](name)
- [aashwin](name)
- [rajat](name)
- [utkarsh](name)
- [puneet](name)
- [gagandeep](name)
- [abhishek](name)
- [vinay](name)
- [harsha](name)
- [shashank](name)
- [piyush](name)
- [sarat](name)
- [rushabh](name)
- [prakashkumar](name)
- [anurag](name)
- [jatin](name)
- [venkatesh](name)
- [bhargav](name)
- [nithin](name)
- [dhoni](name)
- [akhil](name)
- [sanjeev](name)
- [dileep](name)
- [swaroop](name)
- [uday](name)
- [vishwak](name)
- [keshav](name)
- [sanket](name)
- [shivani](name)
- [sakshi](name)
- [priyankshi](name)
- [divya](name)
- [kamya](name)
- [priyanka](name)
- [kritika](name)
- [niharika](name)
- [bhavana](name)
- [lalita](name)
- [harshita](name)
- [tejaswi](name)
- [nikhitha](name)
- [gunjan](name)
- [akanksha](name)
- [pooja](name)
- [vandana](name)
- [geeta](name)
- [suvarna](name)
- [nidhi](name)
- [sireesha](name)
- [monica](name)
- [ankita](name)
- [tanvi](name)
- [priya](name)
- [shikha](name)
- [jyothi](name)
- [sugandha](name)
- [saba](name)
- [naimisha](name)
- [manasa](name)
- [shafa](name)
- [jyotsna](name)
- [prachi](name)
- [smriti](name)
- [sahiba](name)
- [khushboo](name)
- [shephali](name)
- [neha](name)
- [ritu](name)
- [gauri](name)
- [swathi](name)
- [swapna](name)
- [anusha](name)
- [poonam](name)
- [minal](name)
- [sayali](name)
- [hiral](name)
- [devika](name)
- [gayathri](name)
- [meenakshi](name)
- [karishma](name)
- [apoorva](name)
- [shailesha](name)
- [rashmi](name)
- [mahima](name)
- [anubha](name)
- [garima](name)
- [pallavi](name)
- [raghavi](name)
- [sayona](name)
- [jagruthi](name)
- [pranali](name)
- [shilpa](name)
- [sheena](name)
- [charmi](name)
- [arushi](name)
- [archana](name)
- [saroj](name)
- [heena](name)
- [preethi](name)
- [komali](name)
- [yoshita](name)
- [nirmala](name)
- [bhumika](name)
- [mayuri](name)
- [rithika](name)
- [vani](name)
- [rani](name)
- [reema](name)
- [surbhi](name)
- [parul](name)
- [ramya](name)
- [rekha](name)
- [jayasree](name)
- [omana](name)
- [vidhushi](name)
- [prerna](name)
- [swetha](name)
- [latha](name)
- [manjula](name)
- [riddhi](name)
- [sandhya](name)
- [anupama](name)
- [sunitha](name)
- [sarala](name)
- [namrata](name)
- [samiksha](name)
- [dilisha](name)
- [anamika](name)

## regex:claim_id
- [a-zA-Z0-9]{1,50}

## regex:npi
- [0-9]{10}

@Akhil - ah, my mistake. In fact you still have to provide at least one example of each of these entities. This is because we don’t have access to the domain inside the NLU component, and so we have to check the training data to see what entities exist.

You can remove this restriction in your version of the regex extractor by overriding the extract_patterns method, see Add RegexEntityExtractor by tabergma · Pull Request #6214 · RasaHQ/rasa · GitHub

Hi @amn41. Oh ok. Thank you for the pointer. I added the following 2 lines and the issue seems to be resolved. Now, I need not add even 1 example in training data. Defining the regex is enough.

Please correct me if I have done anything wrong here. Thank you very much for your time and help.

def extract_patterns(
    training_data: TrainingData,
    use_lookup_tables: bool = True,
    use_regexes: bool = True,
    use_only_entities: bool = False,
) -> List[Dict[Text, Text]]:
    """Extract a list of patterns from the training data.

    The patterns are constructed using the regex features and lookup tables defined
    in the training data.

    Args:
        training_data: The training data.
        use_only_entities: If True only lookup tables and regex features with a name
          equal to a entity are considered.
        use_regexes: Boolean indicating whether to use regex features or not.
        use_lookup_tables: Boolean indicating whether to use lookup tables or not.

    Returns:
        The list of regex patterns.
    """

    #Adding names of entities which have a defined regex but 0 examples in training data.    
    for regex in training_data.regex_features:
        training_data.entities.add(regex['name'])
   
    if not training_data.lookup_tables and not training_data.regex_features:
        return []

    patterns = []

    if use_regexes:
        patterns.extend(_collect_regex_features(training_data, use_only_entities))
    
    if use_lookup_tables:
        patterns.extend(
            _convert_lookup_tables_to_regex(training_data, use_only_entities)
        )

    return patterns