Doubts on using 'rasa data convert' on cli with foreign language and special characters

Hey all,

I am having trouble using the data converter available on rasa cli due to some characters and symbols frequently used in brazilian portuguese. I already have a corpus in json that works well on previous Rasa versions. As I am starting to adapt all my routines to be compatible with newer versions (>2.0.0), I tried using the following cli command:

rasa data convert nlu -f md--data=old_data --out=data

(At first the conversion is to ‘md’ because it seems like I need an ‘md’ file to convert to yaml.)

Things go OK and a new file is generated and seems well parsed, except from the special characters probably due to encoding changes in the middle of the proccess. The output file is then filled with strings as ‘\u00e1’ for example for some special characters.

I tried the variations below with no success:

rasa data convert nlu -f yaml --data=old_data --out=data -l pt

rasa data convert nlu -f yaml --data=old_data --out=data -l pt-br

Does anyone know a fix for this? Anyone went through this? I want to know if there is an easy way to deal with this before developing my own parser.

what’s up @bayesianwannabe :smiley: . Well, I also have a project with brazilian portuguese and I migrated to 2.x version easily. You can check guide from 1.x to 2x version, also on blog post that was using older version right here. And of course, Rasa Livecoding. Hope it helps you!

Hey Marcos! Thank you for the answer. I also used the official guides for migration of my project, but I wonder if it’s a particular aspect of the json part of the parser. Mind if I ask what format were your files before the migration?

Maybe ‘md’ to ‘yaml’ doesn’t mess with encoding (as a matter of fact, my stories were nicely parsed using the rasa cli with no encoding issues), but maybe only the ‘json’ to ‘md’ functionality have this issue?

I was using version 1.x, so my files were .md format, but I converted without issues just doing what I told you.

I see! Thank you for your answer. Did you applied the rasa data convert on Linux? Because I am on W10 and I’m starting to think that the OS might be related.

:dancer:, yeah, I do use linux and I didn’t even remember asking you about OS you might be using! If it is related, I have no idea, but I’d suggest to google about this, I’m pretty sure you’ll find something @bayesianwannabe. Sorry to not help you with W10…

No problems at all! I am sorry for so many questions. I wonder if you are being able to use rasa x normally on linux with brazilian portuguese sentences with special characters, as I am also having an encoding problem as I posted here:

Thank you!

no worry! Let’s talk and help each other. And yes, I’m running normal the project. I’ll take a look there.