Dealing with Non-ascii characters

koaning · November 5, 2021, 10:06am

It seems that some of our NLU components fail in the presence of non-ascii characters. This thread was started because of the issue described here but deserves a larger discussion. This will be a thread to discuss the topic in more detail.

nonola · November 5, 2021, 10:09am

Thanks Vincent!

koaning · November 5, 2021, 10:09am

As a motivating example, it’s been suggested that this example breaks in Rasa if you use the CRFEntityExtractor.

- intent: simulação
  examples: |
    - Posso simular o pedido de [pagamento em prestações](tipo_pagamento) de uma [divida](PEF) no Portal das Finanças?
    - Simular [IRS](imposto) em prestações
    - simular [pagamento a prestações]{"entity": "tipo_pagamento", "value": "pagamento em prestações"} [IRS](imposto)
    - simular [prestações]{"entity": "tipo_pagamento", "value": "pagamento em prestações"}
    - Ja acedi a simulaçao 5 meses o valor é de 688.70 como posso finalizar o pedido
    - simular [prestações]{"entity": "tipo_pagamento", "value": "pagamento em prestações"} de [IRS](imposto)
    - como fazer simulação de prestações de [IRS](imposto)
    - onde posso obter simulação para [pagamento prestacional]{"entity": "tipo_pagamento", "value": "pagamento em prestações"} de 39000€ em 36 meses
    - Gostaria de fazer simulação para [dividir em prestações]{"entity": "tipo_pagamento", "value": "pagamento em prestações"} meu [IRS](imposto)
    - quero simular [pagamento a prestações]{"entity": "tipo_pagamento", "value": "pagamento em prestações"} [IRS](imposto)
    - como faço para simular [pagamento a prestações]{"entity": "tipo_pagamento", "value": "pagamento em prestações"} [IRS](imposto)
    - necessito de ajuda para simular [pagamento a prestações]{"entity": "tipo_pagamento", "value": "pagamento em prestações"} [IRS](imposto)
    - Em quantas [prestações]{"entity": "tipo_pagamento", "value": "pagamento em prestações"} posso pagar uma [divida fiscal](PEF)?
    - como conseguir uma simulação de [avaliação](tipo_avaliação) de imóvel
    - como conseguir uma simulação de [avaliação](tipo_avaliação) de uma casa
    - como conseguir uma simulação de [IMI](imposto)
    - como conseguir uma simulação de [IRS](imposto)
    - como conseguir uma simulação de um [plano de prestações]{"entity": "tipo_pagamento", "value": "pagamento em prestações"}
    - como consigo uma simulação de [avaliação](tipo_avaliação) de uma casa

To quote @nonola:

As you can see, I’ve some entity names like “tipo_avaliação”, “tipo_imóvel” or “óbito” which contains non-ascii char.

@nonola just to confirm, if you were to translate the text such that it does not include characters like ç or ã … would that suffice? One approach that might work here is to create an NLU component that takes care of this before the text is tokenized. Also to confirm, this wasn’t an issue with DIET? I understand DIET isn’t feasible now due to Tensorflow 2.6 performance, but it would be good to confirm.

nonola · November 5, 2021, 10:10am

With DIET it works 100%.

nonola · November 5, 2021, 10:12am

I believe it would work, but maybe just for just a temporary solution, because it would be “writing with typos”. Not very pratical.

koaning · November 5, 2021, 10:24am

They may be typos, but it would be up to the machine learning system to learn to deal with those.

That said, isn’t it common for users to not type the accents on the characters? I could imagine that because it’s an extra step on a keyboard many mobile phone users may skip the effort. Feel free to correct me if I am mistaken though since I only speak Dutch and English.

nonola · November 5, 2021, 11:19am

Ok, I understant, but if i use it like this:

[imóvel](tipo_imovel)

There won’t be a problem, right? I mean, if I take of the ´ from the entity name (tipo_imovel) and keep it in the [imóvel]?

koaning · November 5, 2021, 11:54am

The only way to know for sure is to try, but please do. I’d love to get more feedback on this.

If that doesn’t work, I’ll need to dive in a bit deeper myself into the codebase because I may need to start a GitHub issue for this.

nonola · November 5, 2021, 12:28pm

Ok. I’ll give it a try! Thanks Vincent!

nonola · November 5, 2021, 7:04pm

Hi Vicent!

I removed all the non-ascii char from entity name “ç”, “í”, “ó”,…, but then this error keeps appearing:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-10: ordinal not in range(128)

I really can’t discover why.

Can you help me?

Thanks!

koaning · November 8, 2021, 8:26am

This may sound strange, but could you send me the smallest file that’s causing the error? This may be one of those moments where the operating system accidentally added a dangerous character to the file. Could you send it as a file attachment instead of a code snippet? Sometimes the discourse forum fixes some of the characters before storing it.

Anecdote, I knew a team that spent two weeks looking for the reason their pipeline broke. The culprit turned out to be a .tsv file that wasn’t separated by tabs but by the icelandic thorn. I’m wondering if something similar may be happening here.

nonola · November 8, 2021, 8:44am

Hi Vincent!

You mean the nlu.yml, right?

Give me a couple of minutes, please.

Thanks

nonola · November 8, 2021, 12:36pm

Here you have Vincent. Thanks!

koaning · November 10, 2021, 7:02am

I have the file locally, just to check though. Is this file allowed to be public? I’m mentioning it because 1.) this forum is public, and 2.) would you mind if I share this dataset with our research team?

nonola · November 10, 2021, 8:01am

Hi Vincent.

The file has not sensitive, nor confidencial data. You can share it with your team.

Thanks!

koaning · November 10, 2021, 11:57am

Could you also share the config.yml file that’s associated with the error?

koaning · November 10, 2021, 12:05pm

Also, are you also getting these warnings during training? I’m seeing these with the default DIET pipeline.

/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Para rendimentos obtidos no estrangeiro que me foram pagos com moeda diferente do € como devo declarar no anexo J?' with intent 'declarar'. Make sure the start and end values of entities ([(5, 39, 'rendimentos estrangeiro'), (69, 78, 'diferente'), (82, 83, '€'), (106, 113, 'anexo J')]) in the training data match the token boundaries ([(0, 4, 'Para'), (5, 16, 'rendimentos'), (17, 24, 'obtidos'), (25, 27, 'no'), (28, 39, 'estrangeiro'), (40, 43, 'que'), (44, 46, 'me'), (47, 52, 'foram'), (53, 58, 'pagos'), (59, 62, 'com'), (63, 68, 'moeda'), (69, 78, 'diferente'), (79, 81, 'do'), (84, 88, 'como'), (89, 93, 'devo'), (94, 102, 'declarar'), (103, 105, 'no'), (106, 111, 'anexo'), (112, 113, 'J')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'recebi rendimentos do estrangeiro pagos em moeda diferente Euro €. como faço a conversão?' with intent 'declarar'. Make sure the start and end values of entities ([(7, 33, 'rendimentos estrangeiro'), (43, 65, 'moeda diferente Euro €'), (79, 88, 'conversão')]) in the training data match the token boundaries ([(0, 6, 'recebi'), (7, 18, 'rendimentos'), (19, 21, 'do'), (22, 33, 'estrangeiro'), (34, 39, 'pagos'), (40, 42, 'em'), (43, 48, 'moeda'), (49, 58, 'diferente'), (59, 63, 'Euro'), (67, 71, 'como'), (72, 76, 'faço'), (77, 78, 'a'), (79, 88, 'conversão')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Tenho faturas da farmácia com taxa de IVA a 23% entram para o IRS?' with intent 'declarar'. Make sure the start and end values of entities ([(6, 25, 'despesas de saúde'), (38, 47, 'IVA a 23%'), (62, 65, 'IRS')]) in the training data match the token boundaries ([(0, 5, 'Tenho'), (6, 13, 'faturas'), (14, 16, 'da'), (17, 25, 'farmácia'), (26, 29, 'com'), (30, 34, 'taxa'), (35, 37, 'de'), (38, 41, 'IVA'), (42, 43, 'a'), (44, 46, '23'), (48, 54, 'entram'), (55, 59, 'para'), (60, 61, 'o'), (62, 65, 'IRS')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Se arrendar uma casa e uma garagem tenho de fazer duas comunicações diferentes?' with intent 'arrendamento'. Make sure the start and end values of entities ([(3, 20, 'arrendamento urbano'), (68, 77, 'diferente')]) in the training data match the token boundaries ([(0, 2, 'Se'), (3, 11, 'arrendar'), (12, 15, 'uma'), (16, 20, 'casa'), (21, 22, 'e'), (23, 26, 'uma'), (27, 34, 'garagem'), (35, 40, 'tenho'), (41, 43, 'de'), (44, 49, 'fazer'), (50, 54, 'duas'), (55, 67, 'comunicações'), (68, 78, 'diferentes')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Se o imóvel pertencer a vários vendedores quantas (declaracoes) tenho que fazer?' with intent 'IMI'. Make sure the start and end values of entities ([(5, 11, 'imóvel'), (24, 41, 'compropriedade'), (50, 63, '(declaracoes)')]) in the training data match the token boundaries ([(0, 2, 'Se'), (3, 4, 'o'), (5, 11, 'imóvel'), (12, 21, 'pertencer'), (22, 23, 'a'), (24, 30, 'vários'), (31, 41, 'vendedores'), (42, 49, 'quantas'), (51, 62, 'declaracoes'), (64, 69, 'tenho'), (70, 73, 'que'), (74, 79, 'fazer')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Preciso obter o meu registo de contribuinte mas não sei como procedervia [online](portal). É para fins de abertura de conta bancária' with intent 'NIF'. Make sure the start and end values of entities ([(20, 43, 'registo de contribuinte'), (48, 55, 'não sei'), (69, 80, 'via [online')]) in the training data match the token boundaries ([(0, 7, 'Preciso'), (8, 13, 'obter'), (14, 15, 'o'), (16, 19, 'meu'), (20, 27, 'registo'), (28, 30, 'de'), (31, 43, 'contribuinte'), (44, 47, 'mas'), (48, 51, 'não'), (52, 55, 'sei'), (56, 60, 'como'), (61, 72, 'procedervia'), (74, 88, 'online](portal'), (91, 92, 'É'), (93, 97, 'para'), (98, 102, 'fins'), (103, 105, 'de'), (106, 114, 'abertura'), (115, 117, 'de'), (118, 123, 'conta'), (124, 132, 'bancária')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data

nonola · November 10, 2021, 3:50pm

Hi Vicent,

Yes, I always get those errors with DIET.

Here is my config file:

language: pt
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: CRFEntityExtractor
- name: DIETClassifier
  epochs: 30
  learning_rate: 0.005
  constrain_similarities: true
  entity_recognition: False  
- name: EntitySynonymMapper
- name: ResponseSelector
  epochs: 100
  constrain_similarities: true
- name: FallbackClassifier
  threshold: 0.5
  ambiguity_threshold: 0.1
policies:
- name: MemoizationPolicy
- name: RulePolicy
- name: UnexpecTEDIntentPolicy
  max_history: 5
  epochs: 100
- name: TEDPolicy
  max_history: 5
  epochs: 100
  constrain_similarities: true

rumesh · November 10, 2021, 5:03pm

Hi, I’ve been trying to train a RASA bot in Sinhala language. It contains Unicode white-space chars such as the zero width joiner. I have a couple of entities I want to recognize. In the NLU file I have:

- intent: most_popular_song_of_artist
    examples: |
      - [ශිහාන් මිහිරංග](artist)ගේ ජනප්‍රියම සින්දුව මොකක්ද?
      - [අතුල අධිකාරී](artist)ගෙ  ජනප්‍රියම ගීතය මොකක්ද?
      - [ෆන්කි ඩර්ට්](artist)ගේ ජනප්‍රියම සිංදුව මොකද්ද කුමක්ද කියලා කියනවද?
      - [නන්දා මාලනී](artist)ගේ ප්‍රසිද්ධම ගීතය කුමක්ද?

- lookup: artist
    examples: |
      - ශිහාන් මිහිරංග
      - මිල්ටන් මල්ලවාරච්චි
      - සුරේන්ද්‍ර  පෙරේරා
      - අතුල අධිකාරී

While training RASA warns me:

UserWarning: Misaligned entity annotation in message 'ශිහාන් මිහිරංගගේ ජනප්‍රියම සින්දුව මොකක්ද?' with intent 'most_popular_song_of_artist'. Make sure the start and end values of entities ([(0, 14, 'ශිහාන් මිහිරංග')]) in the training data match the token boundaries ([(0, 6, 'ශිහාන්'), (7, 16, 'මිහිරංගගේ'), (17, 26, 'ජනප්\u200dරියම'), (27, 34, 'සින්දුව'), (35, 41, 'මොකක්ද')]). Common causes: 
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data

I tried changing the above to the below(added start and end):

- intent: most_popular_song_of_artist
  examples: |
    - [ශිහාන් මිහිරංග]{"entity":"artist","start":0,"end":11}ගේ ජනප්‍රියම සින්දුව මොකක්ද?
    - [අතුල අධිකාරී]{"entity":"artist","start":0,"end":11}ගෙ  ජනප්‍රියම ගීතය මොකක්ද?

But still the DIET classifier returns zero entities. I think this happens due to issues with whit spaces. How can I solve this issue?

nonola · November 10, 2021, 5:08pm

Why don’t you have a whitespace between ) and next letter?

Topic		Replies	Views
Unable to train a model with training data that contains non ascii characters Rasa Open Source	1	463	May 24, 2019
Rasa not picking special characters in an entity Rasa Open Source	9	3338	May 12, 2020
Can you support utf-8? Rasa Open Source	1	497	August 3, 2020
Doubts on using 'rasa data convert' on cli with foreign language and special characters Rasa Open Source	7	1007	December 15, 2020
Can I use non English characters for entity/slot/intent names Getting Started with Rasa varsha	5	214	March 18, 2019

Dealing with Non-ascii characters

Related topics