Uppercase Cyrillic character

Hi all, I’m still new to rasa. I noticed strange error if the file (nlu.yml, nlu_*.yml) contains an uppercase Cyrillic ‘И’ character. Executable commands rasa [ shell | train | data validate ] fail. From version 2.1 to version 2.4 there were no problems. Last rows in from debug log: File “c:\users\rakis\anaconda3\envs\rasa_env\lib\encodings\cp1251.py”, line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x98 in position 4089: character maps to On the example of the Russian alphabet Error:

  • intent: inform examples: |
    • АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ

and works in lowercase ‘И’ character:

  • intent: inform examples: |
    • АБВГДЕЁЖЗиЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ

rasa --version Rasa Version : 2.4.0 Rasa SDK Version : 2.4.0 Rasa X Version : None Python Version : 3.8.8 (also tried version 3.7.6) Operating System : Windows-10-10.0.19041-SP0 Python Path : c:\users\rakis\anaconda3\envs\rasa_env\python.exe

The rest of the project files are original, which are created by the command rasa init

P.S. I’m really trying to improve my English :slight_smile:

1 Like

You’ll want to save your training data files in UTF-8 encoding - the error indicates you’re using cp1251. If you’re using an IDE like VSCode or a text editor like Notepad++ you can make that the default (which I would recommend)

I have the same problem @mloubser:

In my case, the error says CP1252, but I checked all of my files, they’re all encoded UTF-8 according to VSCode.

Hi all. I’ve just tried to reproduce this issue and I can confirm that on my machine it seems to work fine. I’ve started a rasa init project with this example in the nlu.yml:

АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ

The rasa train command works fine on my machine and I have tried on Rasa 2.3.4 and 2.4.1. This makes me think that what you’re experiencing might be related to Windows.

Are you 100% sure that it’s not a UTF-8 related issue?

1 Like

Thanks for the response.

This is weird, VSCode says all my files are UTF-8. I even saved them again as UTF-8 just to be sure. Is there a better way of checking than looking at VSCode? image

There’s a settings suggestion here but I believe the utf-8 mention at the bottom should confirm it.

Maybe restart VSCode? I know it sounds silly, but my experience this helps with VSCode settings.

1 Like

Yeah, both the default encoding and the footer show UTF-8.

I’ll try rasa init and put a few examples.

@koaning I did rasa init, and I changed the three attached files (just added an intent that leads to utter_greet).

nlu.yml (3.7 KB) rules.yml (331 Bytes) domain.yml (586 Bytes)

Notepad++ and VSCode both show UTF-8.

After rasa train, I still get the error.

Can you confirm if you can open files via;

file = open(filename, encoding="utf8")
1 Like

I can, and file.readlines() gives the correct output

@millerpro @ChrisRahme I might want to check where in our code this error originates. Could you share the full traceback?

1 Like

Sure:

Skipping registering GPU devices...
Traceback (most recent call last):
  File "E:\Program Files\Python\Python38\lib\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "E:\Program Files\Python\Python38\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "E:\...\_venv\lib\site-packages\rasa\__main__.py", line 134, in <module>
    main()
  File "E:\...\_venv\lib\site-packages\rasa\__main__.py", line 116, in main
    cmdline_arguments.func(cmdline_arguments)
  File "E:\...\_venv\lib\site-packages\rasa\cli\train.py", line 58, in <lambda>
    train_parser.set_defaults(func=lambda args: train(args, can_exit=True))
  File "E:\...\_venv\lib\site-packages\rasa\cli\train.py", line 90, in train
    training_result = rasa.train(
  File "E:\...\_venv\lib\site-packages\rasa\train.py", line 94, in train
    return rasa.utils.common.run_in_loop(
  File "E:\...\_venv\lib\site-packages\rasa\utils\common.py", line 307, in run_in_loop
    result = loop.run_until_complete(f)
  File "E:\Program Files\Python\Python38\lib\asyncio\base_events.py", line 608, in run_until_complete
    return future.result()
  File "E:\...\_venv\lib\site-packages\rasa\train.py", line 151, in train_async
    file_importer = TrainingDataImporter.load_from_config(
  File "E:\...\_venv\lib\site-packages\rasa\shared\importers\importer.py", line 85, in load_from_config
    return TrainingDataImporter.load_from_dict(
  File "E:\...\_venv\lib\site-packages\rasa\shared\importers\importer.py", line 150, in load_from_dict
    RasaFileImporter(
  File "E:\...\_venv\lib\site-packages\rasa\shared\importers\rasa.py", line 33, in __init__
    self._story_files = rasa.shared.data.get_data_files(
  File "E:\...\_venv\lib\site-packages\rasa\shared\data.py", line 152, in get_data_files
    new_data_files = _find_data_files_in_directory(path, filter_predicate)
  File "E:\...\_venv\lib\site-packages\rasa\shared\data.py", line 172, in _find_data_files_in_directory
    if filter_property(full_path):
  File "E:\...\_venv\lib\site-packages\rasa\shared\data.py", line 220, in is_story_file
    return YAMLStoryReader.is_stories_file(
  File "E:\...\_venv\lib\site-packages\rasa\shared\core\training_data\story_reader\yaml_story_reader.py", line 163, in is_stories_file
    return rasa.shared.data.is_likely_yaml_file(file_path) and cls.is_key_in_yaml(
  File "E:\...\_venv\lib\site-packages\rasa\shared\core\training_data\story_reader\yaml_story_reader.py", line 183, in is_key_in_yaml
    return any(
  File "E:\...\_venv\lib\site-packages\rasa\shared\core\training_data\story_reader\yaml_story_reader.py", line 183, in <genexpr>
    return any(
  File "E:\Program Files\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 662: character maps to <undefined>