@kmegalokonomos Funny that you mention the BytePair embeddings. I’m working on a feature over at the nlu-examples repository that will allow you to pass not the embeddings, but the subtokens to be sparsely encoded.
Technically, you could translate latin characters back into Greek and fetch the resulting embedding. This feels a bit experimental, but I suppose you could try and see if it works.
Having said that, I wouldn’t worry about training your own embeddings too much. Assuming you’re using DIET, you’d automatically be training an internal representation for all of your (sub) tokens already. There’s an elaborate thread on this topic here.
This quarter I might start working on a command line tool with paraphrasing tricks. I think paraphrasa
might be an awesome name for that project . Originally I wanted to include mainly spelling related tricks, but I’ll likely also add a Greek-Latin translator as well as a demo. Should the time come around, could I poke you @kmegalokonomos for a review?
Also, could you share anything about how effective the latin translation trick is? Are you only training on the latin characters or both the latin and Greek ones?