I’m trying to develop a chat-bot in the Urdu language using rasa_nlu. My model classifies the intents correctly but fails to extract entities. I’m using python 3.7.2 and my rasa nlu version is 0.14.6 on a Windows 10 machine.
I’ve made sure that the format of the training data is correct i.e. the start and end positions of the entities are correct. I can’t seem to figure out the issue.
The following are the contents of my config.yml file:
language: “ur”
pipeline:
name: “tokenizer_whitespace”
name: “ner_crf”
name: “ner_synonyms”
name: “intent_featurizer_count_vectors”
name: “intent_classifier_tensorflow_embedding”
For reference following is an example of what my traning data looks like:
{
“rasa_nlu_data”: {
“common_examples”: [
{
“intent”: “پوچھنا”,
“entities”: [
{
“end”: 16,
“entity”: “نام”,
“start”: 12,
“value”: “عامر”
}
],
“text”: “اسلام علیکم عامر”
}
]}}
Hi @hijab10, have you tried running on python 3.6? Rasa NLU doesn’t yet support python 3.7 because there’s not yet a rasa-supported version of tensorflow for it. I’m actually surprised you’re not getting import errors if you’re using the tensorflow intent classifier in your pipeline. What version of tensorflow are you running?
Hey @erohmensing thank you for responding. I’m using tensorflow 1.13.1 and I was actually able to resolve my issue by changing the entity name to English whilst keeping the value in Urdu. I don’t know why it works this way but I’m not complaining. And no I didn’t get any import errors. Although, when I ran the same configuration in a docker instance, I was getting the “Unicode error”, I guess due to Urdu being utf-8 encoded.
This is what my training data looks like now:
{ “rasa_nlu_data”: { “common_examples”: [ { “intent”: “greet”, “entities”: [ { “end”: 16, “entity”: “name”, “start”: 12, “value”: “عامر” } ], “text”: “اسلام علیکم عامر” } ]}}
Okay i’m glad it’s working! I think that 1.13.1 actually does work with 3.7, we’re working on getting 3.7 support because of that (i don’t think 1.13.0 did).
The unicode error sounds strange, I believe we stopped supporting stuff that isn’t utf-8 encoded… would you mind posting the error?
Sorry for the late response. I think it is some issue with pycrfsuite.
The error I was getting:
“UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 2-4: ordinal not
in range(128)”
The enire traceback is as follows:
File “C:\Users\hijab\AppData\Local\Programs\Python\Python37\lib\runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “C:\Users\hijab\AppData\Local\Programs\Python\Python37\lib\runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “C:\Users\hijab\AppData\Local\Programs\Python\Python37\lib\site-packages\rasa_nlu\train.py”, line 184, in
num_threads=cmdline_args.num_threads)
File “C:\Users\hijab\AppData\Local\Programs\Python\Python37\lib\site-packages\rasa_nlu\train.py”, line 154, in do_train
interpreter = trainer.train(training_data, **kwargs)
File “C:\Users\hijab\AppData\Local\Programs\Python\Python37\lib\site-packages\rasa_nlu\model.py”, line 196, in train
**context)
File “C:\Users\hijab\AppData\Local\Programs\Python\Python37\lib\site-packages\rasa_nlu\extractors\crf_entity_extractor.py”, line 142, in train
self._train_model(dataset)
File “C:\Users\hijab\AppData\Local\Programs\Python\Python37\lib\site-packages\rasa_nlu\extractors\crf_entity_extractor.py”, line 551, in _train_model
self.ent_tagger.fit(X_train, y_train)
File “C:\Users\hijab\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn_crfsuite\estimator.py”, line 321, in fit
trainer.append(xseq, yseq)
File “pycrfsuite_pycrfsuite.pyx”, line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
File “stringsource”, line 48, in vector.from_py.__pyx_convert_vector_from_py_std_3a__3a_string
File “stringsource”, line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 2-4: ordinal not in range(128)