NLU not predicting entities separated by the '/' character in the new version of Rasa. Why?

Hey all,

I was using version 1.9.0 of Rasa and I needed to load a custom language model that I edited on spacy for having the model behave as I would need.

Basically, I need the ‘/’ character sometimes interpreted as a token itself. For example, in the default spacy behavior for my language (pt-br), the sentence ‘I bought it 05/20’ would be tokenized as: ‘I’, ‘bought’, ‘it’, ‘05/20’.

I adapted the language package and saved it to have the sentence tokenized as: ‘I’, ‘bought’, ‘it’, ‘05’, ‘/’, ‘20’.

This second behavior is crucial for my annotation scheme. I solved it on Rasa 1.9.0 using the following config.yml:

language: "pt"
pipeline:
  - name: "SpacyNLP"
    model: "my_personal_lang_model"
  - name: "SpacyTokenizer"
  - name: LexicalSyntacticFeaturizer
    features: [
        ["low", "title", "upper", "pos", "pos2", "digit"],
        [
          "low",
          "prefix5",
          "prefix2",
          "suffix5",
          "suffix3",
          "suffix2",
          "upper",
          "title",
          "digit",
          "pos",
          "pos2"
        ],
        ["low", "title", "upper", "pos", "pos2", "digit"],
    ]
  - name: "CountVectorsFeaturizer"
    "min_ngram": 1
    "max_ngram": 2
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: "SpacyFeaturizer"
  - name: "DIETClassifier"
    epochs: 300
    random_seed: 42
    embedding_dimension: 30
  - name: "EntitySynonymMapper"

I needed this custom spacy model because I would get on the training an warning about misalignments on the entity values, cause seems that the ‘/’ characters were not separated and I was unable to assign 2 entities as different parts of a single token… resulting in the model unable to infer the entities MM (month) and DD (day) correctly when testing with ‘rasa shell nlu’. The custom spacy model as well as this config solved the problem with the NLU infering MM/DD date patterns well.

After updating Rasa, I retrained my model with the same config from above and I already checked that my custom model on spacy is still tokenizing things as I expected. No warning is presented when training in the new version but by the time of inference on ‘rasa shell nlu’ my model fails to recognize these entities.

What changed? The Tokenizer component seems to still separate the ‘/’ token but seems like another component has adopted a different treatment regarding this character in the current Rasa version.

I looked at the changelog but I still have difficult understanding why I am having this issue.

I saw that there is a new note on the training data format session, but I am not sure where or how it’s impacting my proccess:

Note

/ symbol is reserved as a delimiter to separate retrieval intents from response text identifiers. Make sure not to use it in the name of your intents.

Any help on this would be great! Is there anything I can do before considering downgrading Rasa to a previous version?

It might be related to this:

But I wonder: if my tokenizer is already splitting a date as ‘MM/DD’ into ‘MM’, ‘/’ and ‘DD’… are the other components somehow bypassing this splitting or am I missing something on the config.yml or the annotation?

My annotations for MM and DD, for example, are ‘05’ and ‘20’, with no empty space in the interval.

@bayesianwannabe It is the same problem as described here: DIET classifier _predict_entities function clean_up_entities for Chinese language issue · Issue #5972 · RasaHQ/rasa · GitHub We added a cleanup method as we saw some issues with some tokenizers. However, that introduced new issues that we did not had on our radar back then. The issue is already fixed in Tokenizers don't split words into sub-words by tabergma · Pull Request #5756 · RasaHQ/rasa · GitHub and will be released in the next minor version. We don’t have a date for that release yet, so please be patient. In the meantime, I would recommend to continue using Rasa 1.9.7. Sorry about that.

Thank you for the answer, Tanja! I did the downgrade and now I am using the same entities with no problems.

Cheers