Hi guys. I’ve been working on a portuguese bot with rasa and getting quite good results so far, using spacy components and a few things more. I’ve searched for the right approach to deal with accents, since in my case I would like to strip then so people who mislead the work “não”, in portuguese, for its variation “nao”, will generate the same vector, especially because people will tend to abbreviate and mislead a lot (I’ve already looked into synonyms, which will be very useful for some cases, but I did not want to explain each strip variation as a synonym).
Any ideias?
My config.yml file looks like this, which makes different nlu scores for similar phrases just by stripping accents:
language: pt
pipeline:
- name: SpacyNLP
- name: SpacyTokenizer
- name: SpacyFeaturizer
return_sequence: true
- name: RegexFeaturizer
return_sequence: true
- name: CRFEntityExtractor
return_sequence: true
- name: EntitySynonymMapper
- name: SklearnIntentClassifier
policies:
- name: TwoStageFallbackPolicy
nlu_threshold: 0.3
core_threshold: 0.3
fallback_core_action_name: action_default_fallback
fallback_nlu_action_name: action_human_handoff
deny_suggestion_intent_name: out_of_scope
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
- name: FormPolicy
The example would be to have this nlu.md example:
intent:deny
- não
- nao
- nunca
- acho que não
and different nlu scores for entries for acho que não
and acho que nao
messages.