Deal with accents with Rasa and Spacy Components

Hi guys. I’ve been working on a portuguese bot with rasa and getting quite good results so far, using spacy components and a few things more. I’ve searched for the right approach to deal with accents, since in my case I would like to strip then so people who mislead the work “não”, in portuguese, for its variation “nao”, will generate the same vector, especially because people will tend to abbreviate and mislead a lot (I’ve already looked into synonyms, which will be very useful for some cases, but I did not want to explain each strip variation as a synonym).

Any ideias?

My config.yml file looks like this, which makes different nlu scores for similar phrases just by stripping accents:

language: pt
pipeline:
- name: SpacyNLP
- name: SpacyTokenizer
- name: SpacyFeaturizer
  return_sequence: true
- name: RegexFeaturizer
  return_sequence: true
- name: CRFEntityExtractor
  return_sequence: true
- name: EntitySynonymMapper
- name: SklearnIntentClassifier
policies:
- name: TwoStageFallbackPolicy
  nlu_threshold: 0.3
  core_threshold: 0.3
  fallback_core_action_name: action_default_fallback
  fallback_nlu_action_name: action_human_handoff 
  deny_suggestion_intent_name: out_of_scope
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
- name: FormPolicy

The example would be to have this nlu.md example:

intent:deny
- não
- nao
- nunca
- acho que não

and different nlu scores for entries for acho que não and acho que nao messages.

Since you’re using spacy, you rely on spacy internal accent stripping

You can create custom component that would preprocess the text the way you want

Thanks, I’ll give it a try :slight_smile: