Help with Rasa for Hebrew

Hello all, Does anyone by any chance have any experience with Rasa for Hebrew? Thanks :slight_smile: Nir

Hi Nir,

I’m can’t say that I know of any examples but I have been working on tools for Rasa compatible tools for Non-English. Here’s a small list of tools/topics with regards to machine learning:

  • Correct me if I am wrong, but I think Hebrew text can be split into tokens by a whitespace tokenizer? If so, the standard pipeline that we give you via rasa init will already work for you.
  • I maintain a library called rasa-nlu-examples that tries to support many Non-English tools. For example, it supports FastText and Byte-Pair Embeddings, both offer pre-trained embeddings for Hebrew.
  • There are multi-language BERT embeddings that are supported via our LanguageModelFeaturizer. In particular, we got good feedback on LaBSE. I’m not aware of any deployment for Hebrew, but it’s a tool that might help.

Having mentioned these tools though, I would like to stress that the most important part of an assistant is the data, not the pipeline. I might focus on getting something demo-able first so that you can start collecting feedback from users. I do not speak Hebrew, but I can imagine that DIET and some simple CountVectors will go a long way when you’re starting out.

If there are more specific issues that you’re concerned with, let me know! I’m working on educational material this quarter on Non-English and if there are any specific blockers for you I’d love to understand that some more. I’m also interested in hearing if there are tools that I should add to Rasa-NLU-Examples for Hebrew.

Dear Vincent,

Thank you so much for the detailed response. I will definitely look into FastText, Byte-Pair Embeddings and the other resources you mention.

Nir

If you’re interested in playing around with these embeddings, you might enjoy playing around with whatlies. It’s a tool I made to explore pre-trained embeddings from a Jupyter notebook. There’s an example here with Arabic benchmarks and a youtube video here where it is used for bulk labelling (which might help if you’re getting started).

Dear Vincent, Thank you so much for all your help. So far I have been able, using your guidance, to use the FastText embeddings (using your wonderful experimental library). I will definitely also check out whatlies later, it’s hard to keep up with all your great advice!

Trouble is that I am currently stuck with using rasa x in local mode. (I have no problem with Rasa in command line mode). The problems I have are very inconsistent and random. When I click on “train” I get an error 50% of the time, while it works fine the remaining 50%. When I try to converse with the bot, I also get problems around 50% of the time. I think the problem is related to the rasa server connectivity, but I have been unable to identify what the problem is, even when running with --debug. I will be happy if you can help me or route this to someone who can, here are the basic env details:

Rasa Version : 2.2.2 Rasa SDK Version : 2.2.0 Rasa X Version : 0.34.0 Python Version : 3.8.0 Operating System : Linux-5.4.0-60-generic-x86_64-with-glibc2.27 Python Path : /home/nailon/rasa_x_test/venv/bin/python3.8

Could you share your config.yml file? Also maybe a traceback of any errors/training messages? Are you able to run rasa train from the command line to see if that works?

rasa train works fine. Never had a problem there. Here is config.yml

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/

# https://rasahq.github.io/rasa-nlu-examples/docs/featurizer/fasttext/
language: he

pipeline:
   - name: WhitespaceTokenizer
   - name: CountVectorsFeaturizer
     analyzer: char_wb
     min_ngram: 1
     max_ngram: 4
   - name: rasa_nlu_examples.featurizers.dense.FastTextFeaturizer
     cache_dir: vecs
     file: cc.he.300.bin    
   - name: DIETClassifier
     epochs: 100
   - name: EntitySynonymMapper
   - name: ResponseSelector
     epochs: 100
   - name: FallbackClassifier
     threshold: 0.3
     ambiguity_threshold: 0.1

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
#   - name: MemoizationPolicy
#   - name: TEDPolicy
#     max_history: 5
#     epochs: 100
#   - name: RulePolicy

This is a strange suggestion, but could you try to change the language to "en"? Just want to confirm that that isn’t the issue. If the issue persists it might be best to start a new thread such that this thread remains on the topic of Hebrew. I can also ping colleagues who are working on Rasa X to have a look on this new thread.

One question, you’ve added a CountVectorizer that grabs subwords. Is there a reason why you didn’t add one for entire words?

I will try switching to “en” and update .

Re countvectorizer, truth is I just copied from somewhere, I didn’t know I am grabbing subwords.

Nir

I did a clean setup using python3.7, using the exact same configuration files and this time things have been working smoothly (so far…)

Note that earlier I used python3.8, and due to problems with installation, I had to downgrade the rasa version I used to 2.2.2, based on advice I saw in this forum. But python3.7 seemed to make all these problems disappear.

Are issues with python3.8 known, or is it just my experience?

I would need to check GitHub to know for sure. I’ve been working on python3.7 since I started working.

Dear Vincent,

I started working with Rasa using docker, but I am quite new to docker and a bit confused as to how to include the required pip dependencies to rasa docker image.

More specifically, I would like to add the following python libraries (which worked nicely for me in local mode): pip install fasttext pip install git+https://github.com/RasaHQ/rasa-nlu-examples

Do I need to create a new rasa docker image for that, by extending the rasa/rasa docker image? Or is there a better way for that?

Thanks Nir

Never mind, I figured it out, thanks