I’m working on a Rasa CALM model and want to leverage my GPU for faster inference. Ideally, I’d like to use GPTQ models, but I understand the current llama-cpp-python library only supports the GGUF format.
Anyone have suggestions for using GPTQ models with Rasa CALM on the GPU?