Regex entity extractor generated a incomplete report

dsmendes · November 25, 2021, 2:57pm

Hi, I think that I have a problem when generating RegexEntityExtractor_errors.json. The report is generated by running rasa test nlu. Analysing this report I notice that the entities are only extracted by CRFEntityExtractor. Bellow I pasted an example from the file generated. However, when I used rasa shell nlu the two extractors can extract correctly the entity.

RegexEntityExtractor_errors.json

  {
  "text": "Compare my tariff with tariff_type_2",
    "entities": [
      {
        "start": 23,
        "end": 36,
        "value": "tariff_type_2",
        "entity": "tariff_type"
      }
    ],
    "predicted_entities": [
      {
        "entity": "tariff_type",
        "start": 23,
        "end": 36,
        "confidence_entity": 0.9832901695523858,
        "value": "tariff_type_2",
        "extractor": "CRFEntityExtractor"
      }
    ]
  },

NLU:

{
  "text": "Compare my tariff with tariff_type_2",
  "intent": {
    "name": "tariff_comparison",
    "confidence": 0.984128006208084
  },
  "entities": [
    {
      "entity": "tariff_type",
      "start": 23,
      "end": 36,
      "value": "tariff_type_2",
      "extractor": "RegexEntityExtractor"
    },
    {
      "entity": "tariff_type",
      "start": 23,
      "end": 36,
      "confidence_entity": 0.9947785145759475,
      "value": "tariff_type_2",
      "extractor": "CRFEntityExtractor"
    }
  ],

Can anyone explain to me why is it happening?

Another thing is. During the cross validation (5 folds) I found the following warning:

UserWarning: No lookup tables or regexes defined in the training data that have a name equal to any entity in the training data. In order for this component to work you need to define valid lookup tables or regexes in the training data.

I have 50 examples where 19 don’t have any entity of lookup table Thank you.

MatthiasLeimeister · December 1, 2021, 2:21pm

Hi @dsmendes, welcome to the community forum! Could you please provide more information on your setup? What Rasa version are you using (rasa --version)? Could you also please post all associated YAML files (config.yml, domain.yml, nlu.yml, etc.) that lead to the observed issue? Please enclose the contents in code blocks (using ```). Thanks!

dsmendes · December 2, 2021, 11:05am

Hi,

my rasa version:

└─ $ rasa --version
Rasa Version      :         2.8.2
Minimum Compatible Version: 2.8.0
Rasa SDK Version  :         2.8.2
Rasa X Version    :         None
Python Version    :         3.8.0
Operating System  :         Linux-5.4.0-90-generic-x86_64-with-glibc2.27
Python Path       :         xxxxxxxxx

I compress the rasa files here data.zip (8.3 KB)

Thanks

MatthiasLeimeister · December 2, 2021, 9:59pm

Hi @dsmendes, thanks for sending the config and data files. I was able to reproduce the issue and this looks like a bug in the creation of the cross-validation folds, where the lookup tables are not kept when the train and test data objects are created. I filed a bug report here:

github.com/RasaHQ/rasa

Cross validation not working for RegexEntityExtractor with lookup tables

opened 09:56PM - 02 Dec 21 UTC

mleimeister

type:bug

area:rasa-oss

### Rasa Open Source version 2.8.15 ### Rasa SDK version _No response_ ### R…asa X version _No response_ ### Python version 3.8 ### What operating system are you using? OSX ### What happened? Based on this [forum report](http://forum.rasa.com/t/regex-entity-extractor-generated-a-incomplete-report/49529), which I was able to reproduce, it looks like using `rasa test nlu` with cross validation does not properly work for `RegexEntityExtractor` with lookup tables. When running cross validation with the provided NLU data, the following warning shows up: ``` UserWarning: No lookup tables or regexes defined in the training data that have a name equal to any entity in the training data. In order for this component to work you need to define valid lookup tables or regexes in the training data. ``` Afterwards, the report shows no entities extracted by `RegexEntityExtractor` with confusion matrix <img src="https://user-images.githubusercontent.com/10855680/144508400-f037d3cc-8e71-43e5-a508-c5aa0eecc73b.png" width=300 height=300> ### Source of the problem Stepping through the code showed that when folds are generated from the training data in [generate_folds](https://github.com/RasaHQ/rasa/blob/2.8.x/rasa/nlu/test.py#L1506), the `TrainingData` objects created [here](https://github.com/RasaHQ/rasa/blob/2.8.x/rasa/nlu/test.py#L1524) for the folds don't have the `lookup_tables` parameter set, resulting in empty lookup tables for both train and test data. ### Proposed solution Add the `lookup_tables` parameter to take over the lookup tables from the original training data object. ### Command / Request ```shell rasa test nlu --nlu data/nlu.yml --cross-validation --runs 1 --folds 2 ``` ### Relevant log output ```shell (rasa) matthias@Matthiass-MBP forum-49529 % rasa test nlu --nlu data/nlu.yml --cross-validation --runs 1 --folds 2 2021-12-02 22:54:43 INFO rasa.cli.test - Test model using cross validation. 2021-12-02 22:54:46 INFO rasa.nlu.utils.spacy_utils - Trying to load spacy model with name 'en_core_web_md' 2021-12-02 22:54:47 INFO rasa.nlu.components - Added 'SpacyNLP' to component cache. Key 'SpacyNLP-en_core_web_md'. 2021-12-02 22:54:47 INFO rasa.nlu.model - Starting to train component SpacyNLP 2021-12-02 22:54:48 INFO rasa.nlu.model - Finished training component. 2021-12-02 22:54:48 INFO rasa.nlu.model - Starting to train component SpacyTokenizer 2021-12-02 22:54:48 INFO rasa.nlu.model - Finished training component. 2021-12-02 22:54:48 INFO rasa.nlu.model - Starting to train component RegexFeaturizer 2021-12-02 22:54:48 INFO rasa.nlu.model - Finished training component. 2021-12-02 22:54:48 INFO rasa.nlu.model - Starting to train component SpacyFeaturizer 2021-12-02 22:54:48 INFO rasa.nlu.model - Finished training component. 2021-12-02 22:54:48 INFO rasa.nlu.model - Starting to train component RegexEntityExtractor /Users/matthias/Workspace/rasa/rasa/shared/utils/io.py:97: UserWarning: No lookup tables or regexes defined in the training data that have a name equal to any entity in the training data. In order for this component to work you need to define valid lookup tables or regexes in the training data. 2021-12-02 22:54:48 INFO rasa.nlu.model - Finished training component. 2021-12-02 22:54:48 INFO rasa.nlu.model - Starting to train component CRFEntityExtractor 2021-12-02 22:54:48 INFO rasa.nlu.model - Finished training component. 2021-12-02 22:54:48 INFO rasa.nlu.model - Starting to train component EntitySynonymMapper 2021-12-02 22:54:48 INFO rasa.nlu.model - Finished training component. ```

MatthiasLeimeister · December 16, 2021, 5:02pm

Hi @dsmendes, the issue got fixed in the latest release 3.0.3. If you update to that the CV report should be corrected.

dsmendes · December 16, 2021, 5:14pm

Thank you for your help

Topic		Replies	Views
Regex: Unable to extract correct entity according to Regex Rasa Open Source	4	1653	February 21, 2022
Entities can't get extracted with regex Rasa Open Source	18	1213	January 18, 2022
RegexEntityExtractor Slot filling not working in Rasa 3.x Rasa Open Source	1	389	October 28, 2022
RegexEntityExtractor not working in rasa==2.0.0rc4 Rasa Open Source	0	565	October 7, 2020
Improving Extraction of Alphanumeric Entity Rasa Open Source	8	1838	June 30, 2019

Regex entity extractor generated a incomplete report

Related topics