Phonetics Featurizer

Hello everyone, is it possible to add extra features to a featurizer (e.g. countvector)? I am trying to create a custom featurizer to add phonetic representation of words. Or maybe do you have some other suggestion on how to achieve this in the rasa pipeline?

Kind Regards,

Hey @kmegalokonomos, have you looked at custom components?

@kmegalokonomos funny you mention this, because it’s something that I have added to my own toolbelt for scikit-learn here. It’s something that I wouldn’t mind making a component for, but I would like to understand the use-case better before I put effort into it. Why do you think having these features would make a difference? If it’s spelling related, can you argue why these features might contribute something that the countvectors don’t? We should remember, the phonetic heuristics aren’t perfectly consistent.

One thing to perhaps add. I added it to my toolbelt because I was curious on the merit of adding it. My lesson sofar seems that it doesn’t beat the character based countvectorizer in any of my benchmarks.

(hey Sams yes I am currently experimenting with custom components)

Hey @koaning , first of all let me quickly say that I am working with the Greek language. Regarding CountVerctorsFeaturizer we 're having some problems with the ngrams. Mainly we get miss-classifications because one word is a substring of another word or they have the same origin. Lets say “εισερχομενες” and “εξερχομενες” which means incoming and outgoing. This happens a lot in the Greek language. I guess an example in English could be “classif-ication” and “publ-ication”. Also believe it or not there is a thing a “new” language in Greek called Greeklish. That means people writing Greek but with English characters because they re bored to switch keyboards (mostly young people). So one would write “εισερχομενες” and another one would write “eiserxomenes”. Its is like the people are actually doing some sort of phonetic processing in they re head while writing. This would not work with simple ngrams. I have thought of two ways to tackle this:

  1. One would be to create a custom preprocessor (already have) and change the message to phonetics based message and then run message.set(text) to change the message for the next components in the pipeline. This would also handle handle intent examples during training. Or we could do a similar processing on the tokens in the tokenizer. Or finally we could do the same processing inside the CountVectorsFeaturizer as part of the _process_message method. In the end we would use countvectors but based on the processed message.

  2. The second would be to create some sort of PhoneticsFeaturizer that creates features (additional to the CountVectors) that provide insight on how close sounding is a user utterance to the examples. (This is still not 100% clear in my head - I am still thinking of it)

We already have the code to create a phonetics representation of a string in Greek. If we proceeded with solution 1 above where would be the best place to apply it?

Honestly appreciate your input on this, Konstantinos P.S Btw we really enjoyed your presentation at the Rasa Summit, thank you for that! !

1 Like

It’s not out just yet, but as of Rasa 2.5 we will roll out support for spaCy 3.0. I don’t know when support was added for Greek but I do know that support is there now. You should already be able to download spaCy in a notebook and play around. There’s more info on the model here. Have you explored if spaCy is of help? You can configure Rasa to use the lemmas from spaCy as tokens as well.

We already have the code to create a phonetics representation of a string in Greek. If we proceeded with solution 1 above where would be the best place to apply it?

If you’ve got an open repo for it I’d love to have a peek. But what you’re suggesting in step 1 sounds reasonable to me. You want to end up with some sort of sparse vector and I think _process_message should do the trick. I am very curious to be kept in the loop though since many languages these days have a similar problem. Hindi is more like “Hinglish” when kids are typing on keyboards.

+1 - I know almost many indian languages have this thing where phonetics are used to type in qwerty keyboards since the time, internet became a thing in India, it is not only mixed but rather completely typed out of the sound that word makes and majority of the data you would find online to scrape would be in such form. so there is a vast amount of resources online(internet data) for such languages. I tried the supervised embedding once a long time ago but it became quickly clear that it didn’t work for everything.

I noticed a few times facebook doing translation of phonetically typed hindi/bengali into english but i cant find any paper from them or code on how they are doing it.

@souvikg10 I wonder … is there a dataset known to you that has a large corpus of Hinglish texts? I might be able to do a pre-train trick on the tokens that’s worth a try. I might be able to pre-train a subword tokeniser that can directly feed a countvectorizer. I’ve open-sourced one such implementation scikit-learn here but the pretrained variant that I currently have is trained on Wikipedia.

i saw something recently called the HOT dataset - Highly Offensive Tweets, where someone used the twitter API to scrape tweets in Hinglish(where atleast more than 3 hindi words are used), i dont know where that dataset went but the closest when i search on Github, i found this

Ah yeah. Though profanity datasets are good to study, I doubt they’ll result in subtokens ready to be applied in a common scenario.

if you want a great dataset, i think this trick would work where you start listening to some trendy twitter topics or facebook public pages with particular hashtags, in India IPL(Indian Premier League) starting tomorrow so twitter is going to be pretty busy :smiley: i am sure you will be able to collect a lot of hinglish tweets that way. other option is facebook pages of popular celebrities in India and pull the comments, these are public facebook pages thus you should be able to.