Hugging face LLM instead of OpenAI

Hi,

I’m new to rasa pro I want to use hugging face for the LLM instead of OpenAI.

I’m getting this error while running.

C:\Users\chandrasekhar.m\AppData\Local\anaconda3\envs\plus\lib\site-packages\huggingface_hub\utils_deprecation.py:131: FutureWarning: ‘InferenceApi’ (from ‘huggingface_hub.inference_api’) is deprecated and will be removed from version ‘1.0’. InferenceApi client is deprecated in favor of the more feature-complete InferenceClient. Check out this guide to learn how to convert your script to use it: Run Inference on servers. warnings.warn(warning_message, FutureWarning) 2024-03-27 14:29:49 ERROR rasa.utils.log_utils - [error ] llm_command_generator.llm.error error=ValueError(‘Error raised by inference API: Request failed during generation: Server error: Out of available cache blocks: asked 52, only 11 free blocks’) C:\Users\chandrasekhar.m\AppData\Local\anaconda3\envs\plus\lib\site-packages\sanic\server\websockets\impl.py:521: DeprecationWarning: The explicit passing of coroutine objects to asyncio.wait() is deprecated since Python 3.8, and scheduled for removal in Python 3.11. done, pending = await asyncio.wait( C:\Users\chandrasekhar.m\AppData\Local\anaconda3\envs\plus\lib\site-packages\huggingface_hub\utils_deprecation.py:131: FutureWarning: ‘InferenceApi’ (from ‘huggingface_hub.inference_api’) is deprecated and will be removed from version ‘1.0’. InferenceApi client is deprecated in favor of the more feature-complete InferenceClient. Check out this guide to learn how to convert your script to use it: Run Inference on servers. warnings.warn(warning_message, FutureWarning) 2024-03-27 14:31:56 ERROR rasa.utils.log_utils - [error ] llm_command_generator.llm.error error=ValueError(‘Error raised by inference API: Request failed during generation: Server error: Out of available cache blocks: asked 54, only 8 free blocks’) C:\Users\chandrasekhar.m\AppData\Local\anaconda3\envs\plus\lib\site-packages\sanic\server\websockets\impl.py:521: DeprecationWarning: The explicit passing of coroutine objects to asyncio.wait() is deprecated since Python 3.8, and scheduled for removal in Python 3.11. done, pending = await asyncio.wait(

help me to understand how to use it.

Hello @sekhar8,

Can you share with us the config.yml please ?

sure,

recipe: default.v1
language: en
pipeline:
- name: LLMCommandGenerator
  llm:
    type: "huggingface_hub"
    repo_id: "HuggingFaceH4/zephyr-7b-beta"
    task: "text-generation"

  # model_name: gpt-4
policies:
- name: FlowPolicy
#  - name: rasa_plus.ml.DocsearchPolicy
#  - name: RulePolicy
assistant_id: 20240327-144556-amortized-beam

please suggest any good models for rephrasing the sentence also

when I try it it gives me the following :

 Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors..

Even I am getting same error even after using llama cpp after bot generating 2 or 3 responses. Can Any one please help?

I think that the LLM respond with a format that Rasa can’t understand it! So you may use a custom prompt template