It seems that some of our NLU components fail in the presence of non-ascii characters. This thread was started because of the issue described here but deserves a larger discussion. This will be a thread to discuss the topic in more detail.
Thanks Vincent!
As a motivating example, itâs been suggested that this example breaks in Rasa if you use the CRFEntityExtractor
.
- intent: simulaÃ§ÃŖo
examples: |
- Posso simular o pedido de [pagamento em prestaçÃĩes](tipo_pagamento) de uma [divida](PEF) no Portal das Finanças?
- Simular [IRS](imposto) em prestaçÃĩes
- simular [pagamento a prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} [IRS](imposto)
- simular [prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"}
- Ja acedi a simulaçao 5 meses o valor Ê de 688.70 como posso finalizar o pedido
- simular [prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} de [IRS](imposto)
- como fazer simulaÃ§ÃŖo de prestaçÃĩes de [IRS](imposto)
- onde posso obter simulaÃ§ÃŖo para [pagamento prestacional]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} de 39000âŦ em 36 meses
- Gostaria de fazer simulaÃ§ÃŖo para [dividir em prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} meu [IRS](imposto)
- quero simular [pagamento a prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} [IRS](imposto)
- como faço para simular [pagamento a prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} [IRS](imposto)
- necessito de ajuda para simular [pagamento a prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} [IRS](imposto)
- Em quantas [prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"} posso pagar uma [divida fiscal](PEF)?
- como conseguir uma simulaÃ§ÃŖo de [avaliaÃ§ÃŖo](tipo_avaliaÃ§ÃŖo) de imÃŗvel
- como conseguir uma simulaÃ§ÃŖo de [avaliaÃ§ÃŖo](tipo_avaliaÃ§ÃŖo) de uma casa
- como conseguir uma simulaÃ§ÃŖo de [IMI](imposto)
- como conseguir uma simulaÃ§ÃŖo de [IRS](imposto)
- como conseguir uma simulaÃ§ÃŖo de um [plano de prestaçÃĩes]{"entity": "tipo_pagamento", "value": "pagamento em prestaçÃĩes"}
- como consigo uma simulaÃ§ÃŖo de [avaliaÃ§ÃŖo](tipo_avaliaÃ§ÃŖo) de uma casa
To quote @nonola:
As you can see, Iâve some entity names like âtipo_avaliaÃ§ÃŖoâ, âtipo_imÃŗvelâ or âÃŗbitoâ which contains non-ascii char.
@nonola just to confirm, if you were to translate the text such that it does not include characters like ç
or ÃŖ
âĻ would that suffice? One approach that might work here is to create an NLU component that takes care of this before the text is tokenized. Also to confirm, this wasnât an issue with DIET? I understand DIET isnât feasible now due to Tensorflow 2.6 performance, but it would be good to confirm.
With DIET it works 100%.
I believe it would work, but maybe just for just a temporary solution, because it would be âwriting with typosâ. Not very pratical.
They may be typos, but it would be up to the machine learning system to learn to deal with those.
That said, isnât it common for users to not type the accents on the characters? I could imagine that because itâs an extra step on a keyboard many mobile phone users may skip the effort. Feel free to correct me if I am mistaken though since I only speak Dutch and English.
Ok, I understant, but if i use it like this:
[imÃŗvel](tipo_imovel)
There wonât be a problem, right? I mean, if I take of the ´ from the entity name (tipo_imovel) and keep it in the [imÃŗvel]?
The only way to know for sure is to try, but please do. Iâd love to get more feedback on this.
If that doesnât work, Iâll need to dive in a bit deeper myself into the codebase because I may need to start a GitHub issue for this.
Ok. Iâll give it a try! Thanks Vincent!
Hi Vicent!
I removed all the non-ascii char from entity name âçâ, âÃâ, âÃŗâ,âĻ, but then this error keeps appearing:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-10: ordinal not in range(128)
I really canât discover why.
Can you help me?
Thanks!
This may sound strange, but could you send me the smallest file thatâs causing the error? This may be one of those moments where the operating system accidentally added a dangerous character to the file. Could you send it as a file attachment instead of a code snippet? Sometimes the discourse forum fixes some of the characters before storing it.
Anecdote, I knew a team that spent two weeks looking for the reason their pipeline broke. The culprit turned out to be a .tsv
file that wasnât separated by tabs but by the icelandic thorn. Iâm wondering if something similar may be happening here.
Hi Vincent!
You mean the nlu.yml, right?
Give me a couple of minutes, please.
Thanks
Here you have Vincent. Thanks!
I have the file locally, just to check though. Is this file allowed to be public? Iâm mentioning it because 1.) this forum is public, and 2.) would you mind if I share this dataset with our research team?
Hi Vincent.
The file has not sensitive, nor confidencial data. You can share it with your team.
Thanks!
Could you also share the config.yml
file thatâs associated with the error?
Also, are you also getting these warnings during training? Iâm seeing these with the default DIET pipeline.
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Para rendimentos obtidos no estrangeiro que me foram pagos com moeda diferente do âŦ como devo declarar no anexo J?' with intent 'declarar'. Make sure the start and end values of entities ([(5, 39, 'rendimentos estrangeiro'), (69, 78, 'diferente'), (82, 83, 'âŦ'), (106, 113, 'anexo J')]) in the training data match the token boundaries ([(0, 4, 'Para'), (5, 16, 'rendimentos'), (17, 24, 'obtidos'), (25, 27, 'no'), (28, 39, 'estrangeiro'), (40, 43, 'que'), (44, 46, 'me'), (47, 52, 'foram'), (53, 58, 'pagos'), (59, 62, 'com'), (63, 68, 'moeda'), (69, 78, 'diferente'), (79, 81, 'do'), (84, 88, 'como'), (89, 93, 'devo'), (94, 102, 'declarar'), (103, 105, 'no'), (106, 111, 'anexo'), (112, 113, 'J')]). Common causes:
1) entities include trailing whitespaces or punctuation
2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'recebi rendimentos do estrangeiro pagos em moeda diferente Euro âŦ. como faço a conversÃŖo?' with intent 'declarar'. Make sure the start and end values of entities ([(7, 33, 'rendimentos estrangeiro'), (43, 65, 'moeda diferente Euro âŦ'), (79, 88, 'conversÃŖo')]) in the training data match the token boundaries ([(0, 6, 'recebi'), (7, 18, 'rendimentos'), (19, 21, 'do'), (22, 33, 'estrangeiro'), (34, 39, 'pagos'), (40, 42, 'em'), (43, 48, 'moeda'), (49, 58, 'diferente'), (59, 63, 'Euro'), (67, 71, 'como'), (72, 76, 'faço'), (77, 78, 'a'), (79, 88, 'conversÃŖo')]). Common causes:
1) entities include trailing whitespaces or punctuation
2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Tenho faturas da farmÃĄcia com taxa de IVA a 23% entram para o IRS?' with intent 'declarar'. Make sure the start and end values of entities ([(6, 25, 'despesas de saÃēde'), (38, 47, 'IVA a 23%'), (62, 65, 'IRS')]) in the training data match the token boundaries ([(0, 5, 'Tenho'), (6, 13, 'faturas'), (14, 16, 'da'), (17, 25, 'farmÃĄcia'), (26, 29, 'com'), (30, 34, 'taxa'), (35, 37, 'de'), (38, 41, 'IVA'), (42, 43, 'a'), (44, 46, '23'), (48, 54, 'entram'), (55, 59, 'para'), (60, 61, 'o'), (62, 65, 'IRS')]). Common causes:
1) entities include trailing whitespaces or punctuation
2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Se arrendar uma casa e uma garagem tenho de fazer duas comunicaçÃĩes diferentes?' with intent 'arrendamento'. Make sure the start and end values of entities ([(3, 20, 'arrendamento urbano'), (68, 77, 'diferente')]) in the training data match the token boundaries ([(0, 2, 'Se'), (3, 11, 'arrendar'), (12, 15, 'uma'), (16, 20, 'casa'), (21, 22, 'e'), (23, 26, 'uma'), (27, 34, 'garagem'), (35, 40, 'tenho'), (41, 43, 'de'), (44, 49, 'fazer'), (50, 54, 'duas'), (55, 67, 'comunicaçÃĩes'), (68, 78, 'diferentes')]). Common causes:
1) entities include trailing whitespaces or punctuation
2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Se o imÃŗvel pertencer a vÃĄrios vendedores quantas (declaracoes) tenho que fazer?' with intent 'IMI'. Make sure the start and end values of entities ([(5, 11, 'imÃŗvel'), (24, 41, 'compropriedade'), (50, 63, '(declaracoes)')]) in the training data match the token boundaries ([(0, 2, 'Se'), (3, 4, 'o'), (5, 11, 'imÃŗvel'), (12, 21, 'pertencer'), (22, 23, 'a'), (24, 30, 'vÃĄrios'), (31, 41, 'vendedores'), (42, 49, 'quantas'), (51, 62, 'declaracoes'), (64, 69, 'tenho'), (70, 73, 'que'), (74, 79, 'fazer')]). Common causes:
1) entities include trailing whitespaces or punctuation
2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
/home/vincent/Development/rasa-non-ascii/venv/lib/python3.7/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message 'Preciso obter o meu registo de contribuinte mas nÃŖo sei como procedervia [online](portal). Ã para fins de abertura de conta bancÃĄria' with intent 'NIF'. Make sure the start and end values of entities ([(20, 43, 'registo de contribuinte'), (48, 55, 'nÃŖo sei'), (69, 80, 'via [online')]) in the training data match the token boundaries ([(0, 7, 'Preciso'), (8, 13, 'obter'), (14, 15, 'o'), (16, 19, 'meu'), (20, 27, 'registo'), (28, 30, 'de'), (31, 43, 'contribuinte'), (44, 47, 'mas'), (48, 51, 'nÃŖo'), (52, 55, 'sei'), (56, 60, 'como'), (61, 72, 'procedervia'), (74, 88, 'online](portal'), (91, 92, 'Ã'), (93, 97, 'para'), (98, 102, 'fins'), (103, 105, 'de'), (106, 114, 'abertura'), (115, 117, 'de'), (118, 123, 'conta'), (124, 132, 'bancÃĄria')]). Common causes:
1) entities include trailing whitespaces or punctuation
2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
Hi Vicent,
Yes, I always get those errors with DIET.
Here is my config file:
language: pt
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: CRFEntityExtractor
- name: DIETClassifier
epochs: 30
learning_rate: 0.005
constrain_similarities: true
entity_recognition: False
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
constrain_similarities: true
- name: FallbackClassifier
threshold: 0.5
ambiguity_threshold: 0.1
policies:
- name: MemoizationPolicy
- name: RulePolicy
- name: UnexpecTEDIntentPolicy
max_history: 5
epochs: 100
- name: TEDPolicy
max_history: 5
epochs: 100
constrain_similarities: true
Hi, Iâve been trying to train a RASA bot in Sinhala language. It contains Unicode white-space chars such as the zero width joiner. I have a couple of entities I want to recognize. In the NLU file I have:
- intent: most_popular_song_of_artist
examples: |
- [āˇāˇāˇāˇāļąāˇ āļ¸āˇāˇāˇāļģāļāļ](artist)āļ⎠āļĸāļąāļ´āˇâāļģāˇāļēāļ¸ āˇāˇāļąāˇāļ¯āˇāˇ āļ¸āˇāļāļāˇāļ¯?
- [āļ
āļāˇāļŊ āļ
āļ°āˇāļāˇāļģāˇ](artist)āļ⎠āļĸāļąāļ´āˇâāļģāˇāļēāļ¸ āļāˇāļāļē āļ¸āˇāļāļāˇāļ¯?
- [āˇāļąāˇāļ⎠āļŠāļģāˇāļ§āˇ](artist)āļ⎠āļĸāļąāļ´āˇâāļģāˇāļēāļ¸ āˇāˇāļāļ¯āˇāˇ āļ¸āˇāļāļ¯āˇāļ¯ āļāˇāļ¸āļāˇāļ¯ āļāˇāļēāļŊ⎠āļāˇāļēāļąāˇāļ¯?
- [āļąāļąāˇāļ¯āˇ āļ¸āˇāļŊāļąāˇ](artist)āļ⎠āļ´āˇâāļģāˇāˇāļ¯āˇāļ°āļ¸ āļāˇāļāļē āļāˇāļ¸āļāˇāļ¯?
- lookup: artist
examples: |
- āˇāˇāˇāˇāļąāˇ āļ¸āˇāˇāˇāļģāļāļ
- āļ¸āˇāļŊāˇāļ§āļąāˇ āļ¸āļŊāˇāļŊāˇāˇāļģāļ āˇāļ āˇ
- āˇāˇāļģāˇāļąāˇāļ¯āˇâāļģ āļ´āˇāļģāˇāļģāˇ
- āļ
āļāˇāļŊ āļ
āļ°āˇāļāˇāļģāˇ
While training RASA warns me:
UserWarning: Misaligned entity annotation in message 'āˇāˇāˇāˇāļąāˇ āļ¸āˇāˇāˇāļģāļāļāļ⎠āļĸāļąāļ´āˇâāļģāˇāļēāļ¸ āˇāˇāļąāˇāļ¯āˇāˇ āļ¸āˇāļāļāˇāļ¯?' with intent 'most_popular_song_of_artist'. Make sure the start and end values of entities ([(0, 14, 'āˇāˇāˇāˇāļąāˇ āļ¸āˇāˇāˇāļģāļāļ')]) in the training data match the token boundaries ([(0, 6, 'āˇāˇāˇāˇāļąāˇ'), (7, 16, 'āļ¸āˇāˇāˇāļģāļāļāļāˇ'), (17, 26, 'āļĸāļąāļ´āˇ\u200dāļģāˇāļēāļ¸'), (27, 34, 'āˇāˇāļąāˇāļ¯āˇāˇ'), (35, 41, 'āļ¸āˇāļāļāˇāļ¯')]). Common causes:
1) entities include trailing whitespaces or punctuation
2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
I tried changing the above to the below(added start and end):
- intent: most_popular_song_of_artist
examples: |
- [āˇāˇāˇāˇāļąāˇ āļ¸āˇāˇāˇāļģāļāļ]{"entity":"artist","start":0,"end":11}āļ⎠āļĸāļąāļ´āˇâāļģāˇāļēāļ¸ āˇāˇāļąāˇāļ¯āˇāˇ āļ¸āˇāļāļāˇāļ¯?
- [āļ
āļāˇāļŊ āļ
āļ°āˇāļāˇāļģāˇ]{"entity":"artist","start":0,"end":11}āļ⎠āļĸāļąāļ´āˇâāļģāˇāļēāļ¸ āļāˇāļāļē āļ¸āˇāļāļāˇāļ¯?
But still the DIET classifier returns zero entities. I think this happens due to issues with whit spaces. How can I solve this issue?
Why donât you have a whitespace between ) and next letter?