Hey @koaning sorry for the late repy!
“Have you explored if spaCy is of help? You can configure Rasa to use the lemmas from spaCy as tokens as well.”
Yes I have tried spacy. Our case is very domain specific and pretrained embeddings didn’t work well. I then tried using only the tokens / lemma from Spacy and feed it to the countvector featurizer but that didn’t work well either because lemma success percentage for Greek is about 56% in Spacy(very low).
One more reason that typos are hard to get in Greek is that there are many letters to write the same vowel. For example “i” in Greek sounds the same as “ι”,“υ”,“η”,“ει” and very commonly people misuse them.
“I’ve been thinking about your Greek use-case and part of me is now wondering if it’s perhaps easier to solve the problem by paraphrasing. We could take the Greek text from the Greek alphabet and turn it into Latin. That means that technically, we could make two nlu.yml files. One for Latin and one for Greek. We could then have Rasa train on both.”
This is actually similar to what I have done now. I have transformed the examples into a Latin representation.
“is your dataset publicly available? I’ve actually got a small set of tools that I’d love to try out.” Unfortunately we don’t have anything public available yet
Quick question, at some point I used the byte pair embeddings you had suggested in the workshop and it worked better than spacy. However that wouldn’t work with the message transformed in Latin characters, correct? Is there a way to train the byte pair with a new data set?
Thank you so much for your time