Challenges with Extracting Arabic Entities in Rasa for a Multi-Intent Chatbot

ashaaOufi · April 20, 2025, 11:02am

Hi Rasa Community,

I’m currently working on an Arabic-language chatbot designed to assist students with academic inquiries. The chatbot is intended to handle a wide range of intents—over 100 in total—covering topics like course details, academic guidance, career advice, graduation requirements, and more. While I’ve made significant progress in developing the bot, I’m encountering challenges specifically with entity extraction for Arabic text.

The Problem: My chatbot uses custom entities (e.g., course_name) to identify key pieces of information in user queries. For example:

User input: “Tell me about Physics”
Expected entity extraction: course_name = Physics

However, the NLU model often fails to extract these entities correctly when users phrase their questions in natural or informal Arabic. For instance:

Input: “حكيلي عن الكيمياء” (Tell me about Chemistry) → Fails to extract course_name.
Input: “شو بتحكي مادة الكيميا” (What does the subject Chemistry cover?) → Also fails to extract course_name.

This issue is particularly problematic because the chatbot relies on these entities to provide accurate responses. Without proper entity extraction, the bot defaults to fallback responses like:

“Please specify the name of the course you’re asking about.”

What I’ve Tried So Far: Training Data:

I’ve added a variety of examples for each intent, including synonyms and informal phrasing, in my nlu.yml file. For example:

- intent: course_inquiry
  examples: |
    - حكيلي عن [الفيزياء](course_name)
    - شو بتحكي [مادة الكيمياء](course_name)
    - وين أجيب مصادر لمادة [الرياضيات](course_name)

I’ve also used lookup tables to include a comprehensive list of course names:

- lookup: course_name
  examples: |
    - فيزياء عامة
    - كيمياء
    - رياضيات
    - برمجة

I’ve defined synonyms for common variations of course names, such as:

- synonym: chemistry_general
  examples: |
    - كيمياء
    - الكيمياء
    - الكيميا

I’ve implemented custom actions to handle cases where entities are missing, prompting users for clarification. For example:
```
if not query:
    dispatcher.utter_message(response="utter_advise_clarify_type")
    return []
```

Despite these efforts, the model still struggles with entity extraction, especially for informal or slightly misspelled Arabic inputs.

Are there specific techniques or tools in Rasa that work better for Arabic NLU?
How can I improve entity extraction for informal Arabic phrases or variations of words?
Would using pre-trained language models (e.g., BERT, AraBERT) help with entity recognition in Arabic? If so, how can I integrate them into my Rasa pipeline?
Any suggestions for optimizing the NLU model when dealing with a large number of intents (100+)?

I’d greatly appreciate any advice, tips, or resources from the community to help resolve this issue. Thank you in advance for your support!

Best regards,

Topic		Replies	Views
Unable to extract entity for a Synonym Rasa Open Source	1	630	June 1, 2020
Empty entities being returned by rasa nlu Rasa Open Source	5	1689	April 23, 2020
Entity Recognition for (Non-English) Language Rasa Open Source	2	1003	April 15, 2020
Intent classification failing when entity extraction is performed Getting Started with Rasa	4	171	December 19, 2018
Is Rasa NLU a good choice for my project? Rasa Open Source	11	2297	September 18, 2018

Challenges with Extracting Arabic Entities in Rasa for a Multi-Intent Chatbot

Related topics