Data format of Rasa for Arabic

Hi there,

I’m using Rasa 1.10.24. I have to make the same chatbot (we have in English) for Arabic. I’m using DIET Classifier, Whitespace Tokenizer, count vectors and language ar. Because I can’t read Arabic and it’s written in a different way, I don’t know how entity annotations work. I looked at nlu files in the Rasa Arabic PoCs, I thought all of the annotations would be like entity value but it doesn’t seem to be this way, and we are directly using google’s translation API to translate the nlu data. I get errors like these when I train NLU:

rasa/utils/common.py:387: UserWarning: Misaligned entity annotation in message ‘ما هي العلامات المبكرة لخلل التنسج الصدري’ with intent ‘user_inform_health’. Make sure the start and end values of entities in the training data match the token boundaries (e.g. entities don’t include trailing whitespaces or punctuation).

Can someone inform me on how entity annotations should take place?

That’s strange.

Is there a snippet of the NLU file that you can pass along to me so I might be able to reproduce the error?

Sure, will ping you from slack.

If anyone’s looking at this thread, turns out that trying to work with RTL languages in code editors like VSCode is quite problematic that it’s better if you have someone in the team that knows Arabic + having them use this tool (that credits go to Rasa Arabic user group) > Rasa Arabic Annotation Helper to annotate your entities is the solution.

1 Like