Unable to extract entity, if entity has parenthesis bracket in it

Hello All,

I am using RASA 2.8.14 and have some strange behaviour, where RASA is unable to extract entities from training data because it has parenthesis bracket in entity values.

My training data looks like this

- intent: app_cdpr_filter_details
  examples: |
    - For [HSQ](cdpr_card_model) I am not able to see [HSQ](cdpr_card_model) in CP
    - For [IMM12(5)](cdpr_card_model) I am not able to see [7750 SR-12](cdpr_device_model) in CP
    - For [IMM12(Up)](cdpr_card_model) I am not able to see [Core Rt](cdpr_device_type) in CP
    - For [IMM2](cdpr_card_model) I am not able to see [EEA](cdpr_device_usage) in CP

For me IIM12(5) is being extracted as IIM12 and IIM12(Up) as IIM12

Can anyone please help how to make it work

1 Like

My config file looks like this

pipeline:
  - name: SpacyNLP
    model: "en_core_web_md"
    case_sensitive: False
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 150
    constrain_similarities: True
    entity_recognition: False
  - name: RegexEntityExtractor
    use_lookup_tables: True
    use_regexes: True
  - name: EntitySynonymMapper
  - name: ResponseSelector
    retrieval_intent: app_cdpr_preference
    epochs: 100
    constrain_similarities: True
  - name: ResponseSelector
    retrieval_intent: app_cdpr_search_site
    epochs: 100
    constrain_similarities: True
  - name: ResponseSelector
    retrieval_intent: app_cdpr_show_hide
    epochs: 100
    constrain_similarities: True
  - name: ResponseSelector
    retrieval_intent: app_cdpr_export
    epochs: 100
    constrain_similarities: True
  - name: ResponseSelector
    retrieval_intent: app_cdpr_info
    epochs: 100
    constrain_similarities: True
  - name: FallbackClassifier
    threshold: 0.3
    ambiguity_threshold: 0.1

policies:
  - name: RulePolicy
  - name: AugmentedMemoizationPolicy
  - name: TEDPolicy
    max_history: 2
    epochs: 200
    constrain_similarities: True
    model_confidence: linear_norm

Hello @naveensiwas nice use case. Have you mentioned

Demo code:

nlu:
- regex: cdpr_card_model
  examples: |
    - \d{10,12} # update your regex based on training example.

I personally have not implemented but it should help.

Ref: Training Data Format

@nik202 thank you so much for the suggestions, for my use case what should be the regex, so it can consider parenthesis bracket as part of entity value.

@naveensiwas you can try create the regex and test here: https://regex101.com

Do you want a regex for this: IMM12(Up)?

@nik202 yes please if you can help me with this.

@naveensiwas ok give me some time.

Ya sure take your time please.

@naveensiwas please cross-check this code using: https://regex101.com

(.*)

@nik202 I am confused with regex, should it match with ) sign followed by ] sign or how?

I have created one regex where it will identify ) sign followed by ] sign, please correct me if I am wrong.

@naveensiwas haha no worries please see how to check the regex :slight_smile:

I only test and created based on your training data and you mentioned parenthesis bracket, try Its generic it should take.

Cross-checked:

Good luck, hope this will help you!!

@nik202 I will try it and get back to you, thank you for your time :slightly_smiling_face:

Have a nice day ahead :+1:

@naveensiwas no worries, I know you will be able to solve this.

Tip: do delete all the previous trained models and re-train again and run ok.

Nik, this Regex will match anything :sweat_smile:

There is no good way to use regex here. I don’t have a solution, but definitely not a Regex that will match anything you give. Imagine every word in your sentence will be extracted as the entity!

When writing Regex, you should not only consider what it matches (otherwise .* will match anything that doesn’t have more than one line!), but importantly what it doesn’t match!

If the cdpr_card_model entity has a finite number of values, even if one million, I would use a Lookup Table for it.

@ChrisRahme :slight_smile:

@ChrisRahme I am using lookup table approach only, but IIM12(5) entity value is being extracted as IIM12 and IIM12(Up) entity value as IIM12

Any idea how should I handle this scenario.

Hi @ChrisRahme,

By looking at the attached screenshots of regex, it’s matching all of my entities value from training data, but still unable to get the exact entity value.

  • IIM12(5) is being extracted as IIM12
  • IIM12(Up) is being extracted as IIM12

I am using lookup table and synonyms approach for my entity extraction, please find the output of rasa shell nlu

Can you please help me on this.

Thank you in advance :slightly_smiling_face:

@nik202 Can you please help me on this.