Phonetics Featurizer

koaning · April 28, 2021, 10:54am

@kmegalokonomos Funny that you mention the BytePair embeddings. I’m working on a feature over at the nlu-examples repository that will allow you to pass not the embeddings, but the subtokens to be sparsely encoded.

Technically, you could translate latin characters back into Greek and fetch the resulting embedding. This feels a bit experimental, but I suppose you could try and see if it works.

Having said that, I wouldn’t worry about training your own embeddings too much. Assuming you’re using DIET, you’d automatically be training an internal representation for all of your (sub) tokens already. There’s an elaborate thread on this topic here.

This quarter I might start working on a command line tool with paraphrasing tricks. I think paraphrasa might be an awesome name for that project . Originally I wanted to include mainly spelling related tricks, but I’ll likely also add a Greek-Latin translator as well as a demo. Should the time come around, could I poke you @kmegalokonomos for a review?

Also, could you share anything about how effective the latin translation trick is? Are you only training on the latin characters or both the latin and Greek ones?

Topic		Replies	Views
A question about twice CountVectorsFeaturizer entry in supervised_embedding pipeline recipe Rasa Open Source	1	1221	October 15, 2019
Custom sentence embedding component Rasa Open Source	0	774	May 8, 2022
Is TF-IDF featurizer beneficial in rasa as custom featurizer? Rasa Open Source	1	423	June 23, 2021
Adding more features to featurizers Rasa Open Source	3	579	October 17, 2019
Rasa CountVectorsFeaturizer Rasa Open Source	0	245	August 23, 2021

Phonetics Featurizer

Related topics