NLU not predicting entities separated by the '/' character in the new version of Rasa. Why?

bayesianwannabe · June 8, 2020, 12:53pm

Hey all,

I was using version 1.9.0 of Rasa and I needed to load a custom language model that I edited on spacy for having the model behave as I would need.

Basically, I need the ‘/’ character sometimes interpreted as a token itself. For example, in the default spacy behavior for my language (pt-br), the sentence ‘I bought it 05/20’ would be tokenized as: ‘I’, ‘bought’, ‘it’, ‘05/20’.

I adapted the language package and saved it to have the sentence tokenized as: ‘I’, ‘bought’, ‘it’, ‘05’, ‘/’, ‘20’.

This second behavior is crucial for my annotation scheme. I solved it on Rasa 1.9.0 using the following config.yml:

language: "pt"
pipeline:
  - name: "SpacyNLP"
    model: "my_personal_lang_model"
  - name: "SpacyTokenizer"
  - name: LexicalSyntacticFeaturizer
    features: [
        ["low", "title", "upper", "pos", "pos2", "digit"],
        [
          "low",
          "prefix5",
          "prefix2",
          "suffix5",
          "suffix3",
          "suffix2",
          "upper",
          "title",
          "digit",
          "pos",
          "pos2"
        ],
        ["low", "title", "upper", "pos", "pos2", "digit"],
    ]
  - name: "CountVectorsFeaturizer"
    "min_ngram": 1
    "max_ngram": 2
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: "SpacyFeaturizer"
  - name: "DIETClassifier"
    epochs: 300
    random_seed: 42
    embedding_dimension: 30
  - name: "EntitySynonymMapper"

I needed this custom spacy model because I would get on the training an warning about misalignments on the entity values, cause seems that the ‘/’ characters were not separated and I was unable to assign 2 entities as different parts of a single token… resulting in the model unable to infer the entities MM (month) and DD (day) correctly when testing with ‘rasa shell nlu’. The custom spacy model as well as this config solved the problem with the NLU infering MM/DD date patterns well.

After updating Rasa, I retrained my model with the same config from above and I already checked that my custom model on spacy is still tokenizing things as I expected. No warning is presented when training in the new version but by the time of inference on ‘rasa shell nlu’ my model fails to recognize these entities.

What changed? The Tokenizer component seems to still separate the ‘/’ token but seems like another component has adopted a different treatment regarding this character in the current Rasa version.

I looked at the changelog but I still have difficult understanding why I am having this issue.

I saw that there is a new note on the training data format session, but I am not sure where or how it’s impacting my proccess:

Note

/ symbol is reserved as a delimiter to separate retrieval intents from response text identifiers. Make sure not to use it in the name of your intents.

Any help on this would be great! Is there anything I can do before considering downgrading Rasa to a previous version?

bayesianwannabe · June 8, 2020, 2:33pm

It might be related to this:

github.com/RasaHQ/rasa

Entity Recognition on sub-words

opened 08:53AM - 27 Mar 20 UTC

closed 03:04PM - 30 Mar 20 UTC

tabergma

type:enhancement

area:rasa-oss

**Description of Problem**: Related to https://github.com/RasaHQ/rasa/issues/54…75 We found other edge cases that can happen if we are using a tokenizer that splits up words into sub-words. Let's take a look at an example: Sentence: `Buenos Aires is a city` Tokens: `Buen`, `os`, `Ai`, `res`, `is`, `a`, `city` Scenario 1: One entity covers multiple words or a single word. `city` entity -> `Buen` `os` `Ai` `res` `type` entity -> `city` Scenario 2: An entity covers just a part of a word. `city` entity -> `Buen` Scenario 3: An entity covers two words, but at least on of the words just partly. `city` entity -> `os` `Ai` Scenario 4: The sub-words of one word are annotated with different entities. `city` entity -> `Ai`, `state` entity -> `res` Scenario 1 and 4 are handled. We need to take care of Scenario 2 and 3. **Overview of the Solution**: We should keep labels if possible. Extend the entities to cover complete words instead of just parts of the words.

But I wonder: if my tokenizer is already splitting a date as ‘MM/DD’ into ‘MM’, ‘/’ and ‘DD’… are the other components somehow bypassing this splitting or am I missing something on the config.yml or the annotation?

My annotations for MM and DD, for example, are ‘05’ and ‘20’, with no empty space in the interval.

Tanja · June 11, 2020, 11:52am

@bayesianwannabe It is the same problem as described here: DIET classifier _predict_entities function clean_up_entities for Chinese language issue · Issue #5972 · RasaHQ/rasa · GitHub We added a cleanup method as we saw some issues with some tokenizers. However, that introduced new issues that we did not had on our radar back then. The issue is already fixed in Tokenizers don't split words into sub-words by tabergma · Pull Request #5756 · RasaHQ/rasa · GitHub and will be released in the next minor version. We don’t have a date for that release yet, so please be patient. In the meantime, I would recommend to continue using Rasa 1.9.7. Sorry about that.

bayesianwannabe · June 11, 2020, 2:13pm

Thank you for the answer, Tanja! I did the downgrade and now I am using the same entities with no problems.

Cheers

Topic		Replies	Views
Misaligned entity annotation for '01/03' in sentence (...) Rasa Open Source	5	830	March 25, 2020
After using SpacyTokenizer: Misaligned entity annotation error when using CRFEntityExtraction Rasa Open Source	0	1050	February 24, 2020
NLU gets one-word entity right, misses extraction Rasa Open Source	2	315	October 20, 2020
Questions of Rasa with Spacy Rasa Open Source	2	361	November 23, 2023
Needs help with make entity with specail characters Feedback on Rasa Open Source	0	17	October 4, 2024

NLU not predicting entities separated by the '/' character in the new version of Rasa. Why?

Related topics