[New Feature] LLMs for Machine Translation of slot-annotated data

Expansion of SLU to new languages requires much work on manual annotation of data. In order to significantly reduce amount of work, LLMs can be used to machine translate english slot-annotated data, e.g.

"play me <a> Dune <a> on <b> Youtube <b>" => "Spiele mir <a> Dune <a> auf <b> Youtube <b>"

In our recent work, we fine-tuned MT-LLM called BigTranslate towards MT of slot-annotated NLU data. We used parallel Amazon MASSIVE dataset for fine-tuning. There is significant performance improvement after fine-tuning (compared to zero-shot LLM-based machine translation + compared to zero-shot mBERT + compared to other state-of-the-art approaches like FC-MTLF) on multiATIS++ benchmark.

Here you can find fine-tuned BigTranslate: Samsung/BigTranslateSlotTranslator · Hugging Face Here you can find code for fine-tuning + code for NLU training: GitHub - Samsung/MT-LLM-NLU: Repository for code related to "LLM-Based Machine Translation for Expansion of Spoken Language Understanding Systems to New Languages" publication.

In short, we want to somehow merge our pipeline into RASA, but I don’t know where to even start, as RASA doesn’t have MT pipelines as for now. I made jira issue for that [OSS-765] - Jira

Yo could write a custom component to do this.