Challenges with Extracting Arabic Entities in Rasa for a Multi-Intent Chatbot

Hi Rasa Community,

I’m currently working on an Arabic-language chatbot designed to assist students with academic inquiries. The chatbot is intended to handle a wide range of intents—over 100 in total—covering topics like course details, academic guidance, career advice, graduation requirements, and more. While I’ve made significant progress in developing the bot, I’m encountering challenges specifically with entity extraction for Arabic text.

The Problem: My chatbot uses custom entities (e.g., course_name) to identify key pieces of information in user queries. For example:

  • User input: “Tell me about Physics
  • Expected entity extraction: course_name = Physics

However, the NLU model often fails to extract these entities correctly when users phrase their questions in natural or informal Arabic. For instance:

  • Input: “حكيلي عن الكيمياء” (Tell me about Chemistry) → Fails to extract course_name.
  • Input: “شو بتحكي مادة الكيميا” (What does the subject Chemistry cover?) → Also fails to extract course_name.

This issue is particularly problematic because the chatbot relies on these entities to provide accurate responses. Without proper entity extraction, the bot defaults to fallback responses like:

“Please specify the name of the course you’re asking about.”

What I’ve Tried So Far: Training Data:

  • I’ve added a variety of examples for each intent, including synonyms and informal phrasing, in my nlu.yml file. For example:

    - intent: course_inquiry
      examples: |
        - حكيلي عن [الفيزياء](course_name)
        - شو بتحكي [مادة الكيمياء](course_name)
        - وين أجيب مصادر لمادة [الرياضيات](course_name)
    
  • I’ve also used lookup tables to include a comprehensive list of course names:

    - lookup: course_name
      examples: |
        - فيزياء عامة
        - كيمياء
        - رياضيات
        - برمجة
    
  • I’ve defined synonyms for common variations of course names, such as:

    - synonym: chemistry_general
      examples: |
        - كيمياء
        - الكيمياء
        - الكيميا
    
  • I’ve implemented custom actions to handle cases where entities are missing, prompting users for clarification. For example:

    if not query:
        dispatcher.utter_message(response="utter_advise_clarify_type")
        return []
    

Despite these efforts, the model still struggles with entity extraction, especially for informal or slightly misspelled Arabic inputs.

Are there specific techniques or tools in Rasa that work better for Arabic NLU?
How can I improve entity extraction for informal Arabic phrases or variations of words?
Would using pre-trained language models (e.g., BERT, AraBERT) help with entity recognition in Arabic? If so, how can I integrate them into my Rasa pipeline?
Any suggestions for optimizing the NLU model when dealing with a large number of intents (100+)?

I’d greatly appreciate any advice, tips, or resources from the community to help resolve this issue. Thank you in advance for your support!

Best regards,