Dealing with Non-ascii characters

It seems that some of our NLU components fail in the presence of non-ascii characters. This thread was started because of the issue described here but deserves a larger discussion. This will be a thread to discuss the topic in more detail.

Thanks Vincent!

As a motivating example, it’s been suggested that this example breaks in Rasa if you use the CRFEntityExtractor.

- intent: simulaÃ§ÃŖo
  examples: |
    - Posso simular o pedido de [pagamento em prestaçÃĩes](tipo_pagamento) de uma [divida](PEF) no Portal das Finanças?
    - Simular [IRS](imposto) em prestaçÃĩes
    - simular [pagamento a prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} [IRS](imposto)
    - simular [prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"}
    - Ja acedi a simulaçao 5 meses o valor Ê de 688.70 como posso finalizar o pedido
    - simular [prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} de [IRS](imposto)
    - como fazer simulaÃ§ÃŖo de prestaçÃĩes de [IRS](imposto)
    - onde posso obter simulaÃ§ÃŖo para [pagamento prestacional]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} de 39000â‚Ŧ em 36 meses
    - Gostaria de fazer simulaÃ§ÃŖo para [dividir em prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} meu [IRS](imposto)
    - quero simular [pagamento a prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} [IRS](imposto)
    - como faço para simular [pagamento a prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} [IRS](imposto)
    - necessito de ajuda para simular [pagamento a prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} [IRS](imposto)
    - Em quantas [prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} posso pagar uma [divida fiscal](PEF)?
    - como conseguir uma simulaÃ§ÃŖo de [avaliaÃ§ÃŖo](tipo_avaliaÃ§ÃŖo) de imÃŗvel
    - como conseguir uma simulaÃ§ÃŖo de [avaliaÃ§ÃŖo](tipo_avaliaÃ§ÃŖo) de uma casa
    - como conseguir uma simulaÃ§ÃŖo de [IMI](imposto)
    - como conseguir uma simulaÃ§ÃŖo de [IRS](imposto)
    - como conseguir uma simulaÃ§ÃŖo de um [plano de prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"}
    - como consigo uma simulaÃ§ÃŖo de [avaliaÃ§ÃŖo](tipo_avaliaÃ§ÃŖo) de uma casa

To quote @nonola:

As you can see, I’ve some entity names like “tipo_avaliaÃ§ÃŖo”, “tipo_imÃŗvel” or â€œÃŗbito” which contains non-ascii char.

@nonola just to confirm, if you were to translate the text such that it does not include characters like ç or ÃŖ â€Ļ would that suffice? One approach that might work here is to create an NLU component that takes care of this before the text is tokenized. Also to confirm, this wasn’t an issue with DIET? I understand DIET isn’t feasible now due to Tensorflow 2.6 performance, but it would be good to confirm.

With DIET it works 100%.

1 Like

I believe it would work, but maybe just for just a temporary solution, because it would be “writing with typos”. Not very pratical.

They may be typos, but it would be up to the machine learning system to learn to deal with those.

That said, isn’t it common for users to not type the accents on the characters? I could imagine that because it’s an extra step on a keyboard many mobile phone users may skip the effort. Feel free to correct me if I am mistaken though since I only speak Dutch and English.

Ok, I understant, but if i use it like this:

[imÃŗvel](tipo_imovel)

There won’t be a problem, right? I mean, if I take of the ´ from the entity name (tipo_imovel) and keep it in the [imÃŗvel]?

The only way to know for sure is to try, but please do. I’d love to get more feedback on this.

If that doesn’t work, I’ll need to dive in a bit deeper myself into the codebase because I may need to start a GitHub issue for this.

Ok. I’ll give it a try! Thanks Vincent!

1 Like

Hi Vicent!

I removed all the non-ascii char from entity name “ç”, “í”, â€œÃŗâ€,â€Ļ, but then this error keeps appearing:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-10: ordinal not in range(128)

I really can’t discover why.

Can you help me?

Thanks!

This may sound strange, but could you send me the smallest file that’s causing the error? This may be one of those moments where the operating system accidentally added a dangerous character to the file. Could you send it as a file attachment instead of a code snippet? Sometimes the discourse forum fixes some of the characters before storing it.

Anecdote, I knew a team that spent two weeks looking for the reason their pipeline broke. The culprit turned out to be a .tsv file that wasn’t separated by tabs but by the icelandic thorn. I’m wondering if something similar may be happening here.

Hi Vincent!

You mean the nlu.yml, right?

Give me a couple of minutes, please.

Thanks

1 Like

Here you have Vincent. Thanks!

I have the file locally, just to check though. Is this file allowed to be public? I’m mentioning it because 1.) this forum is public, and 2.) would you mind if I share this dataset with our research team?

Hi Vincent.

The file has not sensitive, nor confidencial data. You can share it with your team.

Thanks!

Could you also share the config.yml file that’s associated with the error?

Also, are you also getting these warnings during training? I’m seeing these with the default DIET pipeline.

/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Para rendimentos obtidos no estrangeiro que me foram pagos com moeda diferente do â‚Ŧ como devo declarar no anexo J?' with intent 'declarar'. Make sure the start and end values of entities ([(5, 39, 'rendimentos estrangeiro'), (69, 78, 'diferente'), (82, 83, 'â‚Ŧ'), (106, 113, 'anexo J')]) in the training data match the token boundaries ([(0, 4, 'Para'), (5, 16, 'rendimentos'), (17, 24, 'obtidos'), (25, 27, 'no'), (28, 39, 'estrangeiro'), (40, 43, 'que'), (44, 46, 'me'), (47, 52, 'foram'), (53, 58, 'pagos'), (59, 62, 'com'), (63, 68, 'moeda'), (69, 78, 'diferente'), (79, 81, 'do'), (84, 88, 'como'), (89, 93, 'devo'), (94, 102, 'declarar'), (103, 105, 'no'), (106, 111, 'anexo'), (112, 113, 'J')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'recebi rendimentos do estrangeiro pagos em moeda diferente Euro â‚Ŧ. como faço a conversÃŖo?' with intent 'declarar'. Make sure the start and end values of entities ([(7, 33, 'rendimentos estrangeiro'), (43, 65, 'moeda diferente Euro â‚Ŧ'), (79, 88, 'conversÃŖo')]) in the training data match the token boundaries ([(0, 6, 'recebi'), (7, 18, 'rendimentos'), (19, 21, 'do'), (22, 33, 'estrangeiro'), (34, 39, 'pagos'), (40, 42, 'em'), (43, 48, 'moeda'), (49, 58, 'diferente'), (59, 63, 'Euro'), (67, 71, 'como'), (72, 76, 'faço'), (77, 78, 'a'), (79, 88, 'conversÃŖo')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Tenho faturas da farmÃĄcia com taxa de IVA a 23% entram para o IRS?' with intent 'declarar'. Make sure the start and end values of entities ([(6, 25, 'despesas de saÃēde'), (38, 47, 'IVA a 23%'), (62, 65, 'IRS')]) in the training data match the token boundaries ([(0, 5, 'Tenho'), (6, 13, 'faturas'), (14, 16, 'da'), (17, 25, 'farmÃĄcia'), (26, 29, 'com'), (30, 34, 'taxa'), (35, 37, 'de'), (38, 41, 'IVA'), (42, 43, 'a'), (44, 46, '23'), (48, 54, 'entram'), (55, 59, 'para'), (60, 61, 'o'), (62, 65, 'IRS')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Se arrendar uma casa e uma garagem tenho de fazer duas comunicaçÃĩes diferentes?' with intent 'arrendamento'. Make sure the start and end values of entities ([(3, 20, 'arrendamento urbano'), (68, 77, 'diferente')]) in the training data match the token boundaries ([(0, 2, 'Se'), (3, 11, 'arrendar'), (12, 15, 'uma'), (16, 20, 'casa'), (21, 22, 'e'), (23, 26, 'uma'), (27, 34, 'garagem'), (35, 40, 'tenho'), (41, 43, 'de'), (44, 49, 'fazer'), (50, 54, 'duas'), (55, 67, 'comunicaçÃĩes'), (68, 78, 'diferentes')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Se o imÃŗvel pertencer a vÃĄrios vendedores quantas (declaracoes) tenho que fazer?' with intent 'IMI'. Make sure the start and end values of entities ([(5, 11, 'imÃŗvel'), (24, 41, 'compropriedade'), (50, 63, '(declaracoes)')]) in the training data match the token boundaries ([(0, 2, 'Se'), (3, 4, 'o'), (5, 11, 'imÃŗvel'), (12, 21, 'pertencer'), (22, 23, 'a'), (24, 30, 'vÃĄrios'), (31, 41, 'vendedores'), (42, 49, 'quantas'), (51, 62, 'declaracoes'), (64, 69, 'tenho'), (70, 73, 'que'), (74, 79, 'fazer')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Preciso obter o meu registo de contribuinte mas nÃŖo sei como procedervia [online](portal). É para fins de abertura de conta bancÃĄria' with intent 'NIF'. Make sure the start and end values of entities ([(20, 43, 'registo de contribuinte'), (48, 55, 'nÃŖo sei'), (69, 80, 'via [online')]) in the training data match the token boundaries ([(0, 7, 'Preciso'), (8, 13, 'obter'), (14, 15, 'o'), (16, 19, 'meu'), (20, 27, 'registo'), (28, 30, 'de'), (31, 43, 'contribuinte'), (44, 47, 'mas'), (48, 51, 'nÃŖo'), (52, 55, 'sei'), (56, 60, 'como'), (61, 72, 'procedervia'), (74, 88, 'online](portal'), (91, 92, 'É'), (93, 97, 'para'), (98, 102, 'fins'), (103, 105, 'de'), (106, 114, 'abertura'), (115, 117, 'de'), (118, 123, 'conta'), (124, 132, 'bancÃĄria')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data

Hi Vicent,

Yes, I always get those errors with DIET.

Here is my config file:

language: pt
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: CRFEntityExtractor
- name: DIETClassifier
  epochs: 30
  learning_rate: 0.005
  constrain_similarities: true
  entity_recognition: False  
- name: EntitySynonymMapper
- name: ResponseSelector
  epochs: 100
  constrain_similarities: true
- name: FallbackClassifier
  threshold: 0.5
  ambiguity_threshold: 0.1
policies:
- name: MemoizationPolicy
- name: RulePolicy
- name: UnexpecTEDIntentPolicy
  max_history: 5
  epochs: 100
- name: TEDPolicy
  max_history: 5
  epochs: 100
  constrain_similarities: true

Hi, I’ve been trying to train a RASA bot in Sinhala language. It contains Unicode white-space chars such as the zero width joiner. I have a couple of entities I want to recognize. In the NLU file I have:

- intent: most_popular_song_of_artist
    examples: |
      - [āˇāˇ’āˇ„āˇāļąāˇŠ āļ¸āˇ’⎄⎒āļģāļ‚āļœ](artist)āļœāˇš āļĸāļąāļ´āˇŠâ€āļģ⎒āļēāļ¸ āˇƒāˇ’āļąāˇŠāļ¯āˇ”⎀ āļ¸āˇœāļšāļšāˇŠāļ¯?
      - [āļ…āļ­āˇ”āļŊ āļ…āļ°āˇ’āļšāˇāļģ⎓](artist)āļœāˇ™  āļĸāļąāļ´āˇŠâ€āļģ⎒āļēāļ¸ āļœāˇ“āļ­āļē āļ¸āˇœāļšāļšāˇŠāļ¯?
      - [⎆āļąāˇŠāļšāˇ’ āļŠāļģ⎊āļ§āˇŠ](artist)āļœāˇš āļĸāļąāļ´āˇŠâ€āļģ⎒āļēāļ¸ āˇƒāˇ’āļ‚āļ¯āˇ”⎀ āļ¸āˇœāļšāļ¯āˇŠāļ¯ āļšāˇ”āļ¸āļšāˇŠāļ¯ āļšāˇ’āļēāļŊāˇ āļšāˇ’āļēāļąāˇ€āļ¯?
      - [āļąāļąāˇŠāļ¯āˇ āļ¸āˇāļŊāļąāˇ“](artist)āļœāˇš āļ´āˇŠâ€āļģāˇƒāˇ’āļ¯āˇŠāļ°āļ¸ āļœāˇ“āļ­āļē āļšāˇ”āļ¸āļšāˇŠāļ¯?

- lookup: artist
    examples: |
      - āˇāˇ’āˇ„āˇāļąāˇŠ āļ¸āˇ’⎄⎒āļģāļ‚āļœ
      - āļ¸āˇ’āļŊ⎊āļ§āļąāˇŠ āļ¸āļŊ⎊āļŊāˇ€āˇāļģāļ āˇŠāļ āˇ’
      - āˇƒāˇ”āļģ⎚āļąāˇŠāļ¯āˇŠâ€āļģ  āļ´āˇ™āļģ⎚āļģāˇ
      - āļ…āļ­āˇ”āļŊ āļ…āļ°āˇ’āļšāˇāļģ⎓

While training RASA warns me:

UserWarning: Misaligned entity annotation in message 'āˇāˇ’āˇ„āˇāļąāˇŠ āļ¸āˇ’⎄⎒āļģāļ‚āļœāļœāˇš āļĸāļąāļ´āˇŠâ€āļģ⎒āļēāļ¸ āˇƒāˇ’āļąāˇŠāļ¯āˇ”⎀ āļ¸āˇœāļšāļšāˇŠāļ¯?' with intent 'most_popular_song_of_artist'. Make sure the start and end values of entities ([(0, 14, 'āˇāˇ’āˇ„āˇāļąāˇŠ āļ¸āˇ’⎄⎒āļģāļ‚āļœ')]) in the training data match the token boundaries ([(0, 6, 'āˇāˇ’āˇ„āˇāļąāˇŠ'), (7, 16, 'āļ¸āˇ’⎄⎒āļģāļ‚āļœāļœāˇš'), (17, 26, 'āļĸāļąāļ´āˇŠ\u200dāļģ⎒āļēāļ¸'), (27, 34, 'āˇƒāˇ’āļąāˇŠāļ¯āˇ”⎀'), (35, 41, 'āļ¸āˇœāļšāļšāˇŠāļ¯')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data

I tried changing the above to the below(added start and end):

- intent: most_popular_song_of_artist
  examples: |
    - [āˇāˇ’āˇ„āˇāļąāˇŠ āļ¸āˇ’⎄⎒āļģāļ‚āļœ]{"entity":"artist","start":0,"end":11}āļœāˇš āļĸāļąāļ´āˇŠâ€āļģ⎒āļēāļ¸ āˇƒāˇ’āļąāˇŠāļ¯āˇ”⎀ āļ¸āˇœāļšāļšāˇŠāļ¯?
    - [āļ…āļ­āˇ”āļŊ āļ…āļ°āˇ’āļšāˇāļģ⎓]{"entity":"artist","start":0,"end":11}āļœāˇ™  āļĸāļąāļ´āˇŠâ€āļģ⎒āļēāļ¸ āļœāˇ“āļ­āļē āļ¸āˇœāļšāļšāˇŠāļ¯?
 
    

But still the DIET classifier returns zero entities. I think this happens due to issues with whit spaces. How can I solve this issue?

Why don’t you have a whitespace between ) and next letter?

1 Like