Hey all,
I was using version 1.9.0 of Rasa and I needed to load a custom language model that I edited on spacy for having the model behave as I would need.
Basically, I need the ‘/’ character sometimes interpreted as a token itself. For example, in the default spacy behavior for my language (pt-br), the sentence ‘I bought it 05/20’ would be tokenized as: ‘I’, ‘bought’, ‘it’, ‘05/20’.
I adapted the language package and saved it to have the sentence tokenized as: ‘I’, ‘bought’, ‘it’, ‘05’, ‘/’, ‘20’.
This second behavior is crucial for my annotation scheme. I solved it on Rasa 1.9.0 using the following config.yml:
language: "pt"
pipeline:
- name: "SpacyNLP"
model: "my_personal_lang_model"
- name: "SpacyTokenizer"
- name: LexicalSyntacticFeaturizer
features: [
["low", "title", "upper", "pos", "pos2", "digit"],
[
"low",
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"upper",
"title",
"digit",
"pos",
"pos2"
],
["low", "title", "upper", "pos", "pos2", "digit"],
]
- name: "CountVectorsFeaturizer"
"min_ngram": 1
"max_ngram": 2
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: "SpacyFeaturizer"
- name: "DIETClassifier"
epochs: 300
random_seed: 42
embedding_dimension: 30
- name: "EntitySynonymMapper"
I needed this custom spacy model because I would get on the training an warning about misalignments on the entity values, cause seems that the ‘/’ characters were not separated and I was unable to assign 2 entities as different parts of a single token… resulting in the model unable to infer the entities MM (month) and DD (day) correctly when testing with ‘rasa shell nlu’. The custom spacy model as well as this config solved the problem with the NLU infering MM/DD date patterns well.
After updating Rasa, I retrained my model with the same config from above and I already checked that my custom model on spacy is still tokenizing things as I expected. No warning is presented when training in the new version but by the time of inference on ‘rasa shell nlu’ my model fails to recognize these entities.
What changed? The Tokenizer component seems to still separate the ‘/’ token but seems like another component has adopted a different treatment regarding this character in the current Rasa version.
I looked at the changelog but I still have difficult understanding why I am having this issue.
I saw that there is a new note on the training data format session, but I am not sure where or how it’s impacting my proccess:
Note
/
symbol is reserved as a delimiter to separate retrieval intents from response text identifiers. Make sure not to use it in the name of your intents.
Any help on this would be great! Is there anything I can do before considering downgrading Rasa to a previous version?