Hello all,
Does anyone by any chance have any experience with Rasa for Hebrew?
Thanks
Nir
Hi Nir,
I’m can’t say that I know of any examples but I have been working on tools for Rasa compatible tools for Non-English. Here’s a small list of tools/topics with regards to machine learning:
- Correct me if I am wrong, but I think Hebrew text can be split into tokens by a whitespace tokenizer? If so, the standard pipeline that we give you via
rasa init
will already work for you. - I maintain a library called rasa-nlu-examples that tries to support many Non-English tools. For example, it supports FastText and Byte-Pair Embeddings, both offer pre-trained embeddings for Hebrew.
- There are multi-language BERT embeddings that are supported via our LanguageModelFeaturizer. In particular, we got good feedback on LaBSE. I’m not aware of any deployment for Hebrew, but it’s a tool that might help.
Having mentioned these tools though, I would like to stress that the most important part of an assistant is the data, not the pipeline. I might focus on getting something demo-able first so that you can start collecting feedback from users. I do not speak Hebrew, but I can imagine that DIET and some simple CountVectors will go a long way when you’re starting out.
If there are more specific issues that you’re concerned with, let me know! I’m working on educational material this quarter on Non-English and if there are any specific blockers for you I’d love to understand that some more. I’m also interested in hearing if there are tools that I should add to Rasa-NLU-Examples for Hebrew.
Dear Vincent,
Thank you so much for the detailed response. I will definitely look into FastText, Byte-Pair Embeddings and the other resources you mention.
Nir
If you’re interested in playing around with these embeddings, you might enjoy playing around with whatlies. It’s a tool I made to explore pre-trained embeddings from a Jupyter notebook. There’s an example here with Arabic benchmarks and a youtube video here where it is used for bulk labelling (which might help if you’re getting started).
Dear Vincent, Thank you so much for all your help. So far I have been able, using your guidance, to use the FastText embeddings (using your wonderful experimental library). I will definitely also check out whatlies later, it’s hard to keep up with all your great advice!
Trouble is that I am currently stuck with using rasa x in local mode. (I have no problem with Rasa in command line mode). The problems I have are very inconsistent and random. When I click on “train” I get an error 50% of the time, while it works fine the remaining 50%. When I try to converse with the bot, I also get problems around 50% of the time. I think the problem is related to the rasa server connectivity, but I have been unable to identify what the problem is, even when running with --debug. I will be happy if you can help me or route this to someone who can, here are the basic env details:
Rasa Version : 2.2.2 Rasa SDK Version : 2.2.0 Rasa X Version : 0.34.0 Python Version : 3.8.0 Operating System : Linux-5.4.0-60-generic-x86_64-with-glibc2.27 Python Path : /home/nailon/rasa_x_test/venv/bin/python3.8
Could you share your config.yml file? Also maybe a traceback of any errors/training messages? Are you able to run rasa train
from the command line to see if that works?
rasa train works fine. Never had a problem there. Here is config.yml
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
# https://rasahq.github.io/rasa-nlu-examples/docs/featurizer/fasttext/
language: he
pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: rasa_nlu_examples.featurizers.dense.FastTextFeaturizer
cache_dir: vecs
file: cc.he.300.bin
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
- name: FallbackClassifier
threshold: 0.3
ambiguity_threshold: 0.1
# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
# - name: MemoizationPolicy
# - name: TEDPolicy
# max_history: 5
# epochs: 100
# - name: RulePolicy
This is a strange suggestion, but could you try to change the language to "en"
? Just want to confirm that that isn’t the issue. If the issue persists it might be best to start a new thread such that this thread remains on the topic of Hebrew. I can also ping colleagues who are working on Rasa X to have a look on this new thread.
One question, you’ve added a CountVectorizer that grabs subwords. Is there a reason why you didn’t add one for entire words?
I will try switching to “en” and update .
Re countvectorizer, truth is I just copied from somewhere, I didn’t know I am grabbing subwords.
Nir
I did a clean setup using python3.7, using the exact same configuration files and this time things have been working smoothly (so far…)
Note that earlier I used python3.8, and due to problems with installation, I had to downgrade the rasa version I used to 2.2.2, based on advice I saw in this forum. But python3.7 seemed to make all these problems disappear.
Are issues with python3.8 known, or is it just my experience?
I would need to check GitHub to know for sure. I’ve been working on python3.7 since I started working.
Dear Vincent,
I started working with Rasa using docker, but I am quite new to docker and a bit confused as to how to include the required pip dependencies to rasa docker image.
More specifically, I would like to add the following python libraries (which worked nicely for me in local mode): pip install fasttext pip install git+https://github.com/RasaHQ/rasa-nlu-examples
Do I need to create a new rasa docker image for that, by extending the rasa/rasa docker image? Or is there a better way for that?
Thanks Nir
Never mind, I figured it out, thanks