Doubts on using 'rasa data convert' on cli with foreign language and special characters

bayesianwannabe · December 10, 2020, 6:25pm

Hey all,

I am having trouble using the data converter available on rasa cli due to some characters and symbols frequently used in brazilian portuguese. I already have a corpus in json that works well on previous Rasa versions. As I am starting to adapt all my routines to be compatible with newer versions (>2.0.0), I tried using the following cli command:

rasa data convert nlu -f md--data=old_data --out=data

(At first the conversion is to ‘md’ because it seems like I need an ‘md’ file to convert to yaml.)

Things go OK and a new file is generated and seems well parsed, except from the special characters probably due to encoding changes in the middle of the proccess. The output file is then filled with strings as ‘\u00e1’ for example for some special characters.

I tried the variations below with no success:

rasa data convert nlu -f yaml --data=old_data --out=data -l pt

rasa data convert nlu -f yaml --data=old_data --out=data -l pt-br

Does anyone know a fix for this? Anyone went through this? I want to know if there is an easy way to deal with this before developing my own parser.

marcos.allysson · December 11, 2020, 10:22pm

what’s up @bayesianwannabe . Well, I also have a project with brazilian portuguese and I migrated to 2.x version easily. You can check guide from 1.x to 2x version, also on blog post that was using older version right here. And of course, Rasa Livecoding. Hope it helps you!

bayesianwannabe · December 14, 2020, 12:03pm

Hey Marcos! Thank you for the answer. I also used the official guides for migration of my project, but I wonder if it’s a particular aspect of the json part of the parser. Mind if I ask what format were your files before the migration?

Maybe ‘md’ to ‘yaml’ doesn’t mess with encoding (as a matter of fact, my stories were nicely parsed using the rasa cli with no encoding issues), but maybe only the ‘json’ to ‘md’ functionality have this issue?

marcos.allysson · December 15, 2020, 5:58pm

I was using version 1.x, so my files were .md format, but I converted without issues just doing what I told you.

bayesianwannabe · December 15, 2020, 6:12pm

I see! Thank you for your answer. Did you applied the rasa data convert on Linux? Because I am on W10 and I’m starting to think that the OS might be related.

marcos.allysson · December 15, 2020, 6:22pm

, yeah, I do use linux and I didn’t even remember asking you about OS you might be using! If it is related, I have no idea, but I’d suggest to google about this, I’m pretty sure you’ll find something @bayesianwannabe. Sorry to not help you with W10…

bayesianwannabe · December 15, 2020, 6:53pm

No problems at all! I am sorry for so many questions. I wonder if you are being able to use rasa x normally on linux with brazilian portuguese sentences with special characters, as I am also having an encoding problem as I posted here:

Thank you!

marcos.allysson · December 15, 2020, 6:55pm

no worry! Let’s talk and help each other. And yes, I’m running normal the project. I’ll take a look there.

Topic		Replies	Views
"rasa data convert nlu" dosn't work Rasa Open Source	6	1286	February 23, 2022
Rasa data convert nlg -f yaml --data=./data --out=./data not working Rasa Open Source	6	1573	March 23, 2021
How do I convert md nlu data to yml? Rasa Open Source	4	2177	November 8, 2022
Error while migrating training data from rasa 1.10 to Rasa 2.0 Rasa Open Source	1	376	October 26, 2020
Rasa data convert Rasa Open Source	2	213	October 9, 2023

Doubts on using 'rasa data convert' on cli with foreign language and special characters

Related topics