Hello everyone, is it possible to add extra features to a featurizer (e.g. countvector)? I am trying to create a custom featurizer to add phonetic representation of words. Or maybe do you have some other suggestion on how to achieve this in the rasa pipeline?
@kmegalokonomos funny you mention this, because it’s something that I have added to my own toolbelt for scikit-learn here. It’s something that I wouldn’t mind making a component for, but I would like to understand the use-case better before I put effort into it. Why do you think having these features would make a difference? If it’s spelling related, can you argue why these features might contribute something that the countvectors don’t? We should remember, the phonetic heuristics aren’t perfectly consistent.
One thing to perhaps add. I added it to my toolbelt because I was curious on the merit of adding it. My lesson sofar seems that it doesn’t beat the character based countvectorizer in any of my benchmarks.
(hey Sams yes I am currently experimenting with custom components)
Hey @koaning , first of all let me quickly say that I am working with the Greek language. Regarding CountVerctorsFeaturizer we 're having some problems with the ngrams. Mainly we get miss-classifications because one word is a substring of another word or they have the same origin. Lets say “εισερχομενες” and “εξερχομενες” which means incoming and outgoing. This happens a lot in the Greek language. I guess an example in English could be “classif-ication” and “publ-ication”. Also believe it or not there is a thing a “new” language in Greek called Greeklish. That means people writing Greek but with English characters because they re bored to switch keyboards (mostly young people). So one would write “εισερχομενες” and another one would write “eiserxomenes”. Its is like the people are actually doing some sort of phonetic processing in they re head while writing. This would not work with simple ngrams. I have thought of two ways to tackle this:
One would be to create a custom preprocessor (already have) and change the message to phonetics based message and then run message.set(text) to change the message for the next components in the pipeline. This would also handle handle intent examples during training. Or we could do a similar processing on the tokens in the tokenizer. Or finally we could do the same processing inside the CountVectorsFeaturizer as part of the _process_message method. In the end we would use countvectors but based on the processed message.
The second would be to create some sort of PhoneticsFeaturizer that creates features (additional to the CountVectors) that provide insight on how close sounding is a user utterance to the examples. (This is still not 100% clear in my head - I am still thinking of it)
We already have the code to create a phonetics representation of a string in Greek. If we proceeded with solution 1 above where would be the best place to apply it?
Honestly appreciate your input on this,
Konstantinos
P.S Btw we really enjoyed your presentation at the Rasa Summit, thank you for that! !
It’s not out just yet, but as of Rasa 2.5 we will roll out support for spaCy 3.0. I don’t know when support was added for Greek but I do know that support is there now. You should already be able to download spaCy in a notebook and play around. There’s more info on the model here. Have you explored if spaCy is of help? You can configure Rasa to use the lemmas from spaCy as tokens as well.
We already have the code to create a phonetics representation of a string in Greek. If we proceeded with solution 1 above where would be the best place to apply it?
If you’ve got an open repo for it I’d love to have a peek. But what you’re suggesting in step 1 sounds reasonable to me. You want to end up with some sort of sparse vector and I think _process_message should do the trick. I am very curious to be kept in the loop though since many languages these days have a similar problem. Hindi is more like “Hinglish” when kids are typing on keyboards.
+1 - I know almost many indian languages have this thing where phonetics are used to type in qwerty keyboards since the time, internet became a thing in India, it is not only mixed but rather completely typed out of the sound that word makes and majority of the data you would find online to scrape would be in such form. so there is a vast amount of resources online(internet data) for such languages. I tried the supervised embedding once a long time ago but it became quickly clear that it didn’t work for everything.
I noticed a few times facebook doing translation of phonetically typed hindi/bengali into english but i cant find any paper from them or code on how they are doing it.
@souvikg10 I wonder … is there a dataset known to you that has a large corpus of Hinglish texts? I might be able to do a pre-train trick on the tokens that’s worth a try. I might be able to pre-train a subword tokeniser that can directly feed a countvectorizer. I’ve open-sourced one such implementation scikit-learn here but the pretrained variant that I currently have is trained on Wikipedia.
i saw something recently called the HOT dataset - Highly Offensive Tweets, where someone used the twitter API to scrape tweets in Hinglish(where atleast more than 3 hindi words are used), i dont know where that dataset went but the closest when i search on Github, i found this
if you want a great dataset, i think this trick would work where you start listening to some trendy twitter topics or facebook public pages with particular hashtags, in India IPL(Indian Premier League) starting tomorrow so twitter is going to be pretty busy i am sure you will be able to collect a lot of hinglish tweets that way. other option is facebook pages of popular celebrities in India and pull the comments, these are public facebook pages thus you should be able to.
@kmegalokonomos I’ve been thinking about your Greek use-case and part of me is now wondering if it’s perhaps easier to solve the problem by paraphrasing. We could take the Greek text from the Greek alphabet and turn it into latin. That means that technically, we could make two nlu.yml files. One for latin and one for greek. We could then have Rasa train on both. Does this approach make sense?
“Have you explored if spaCy is of help? You can configure Rasa to use the lemmas from spaCy as tokens as well.”
Yes I have tried spacy. Our case is very domain specific and pretrained embeddings didn’t work well. I then tried using only the tokens / lemma from Spacy and feed it to the countvector featurizer but that didn’t work well either because lemma success percentage for Greek is about 56% in Spacy(very low).
One more reason that typos are hard to get in Greek is that there are many letters to write the same vowel. For example “i” in Greek sounds the same as “ι”,“υ”,“η”,“ει” and very commonly people misuse them.
“I’ve been thinking about your Greek use-case and part of me is now wondering if it’s perhaps easier to solve the problem by paraphrasing. We could take the Greek text from the Greek alphabet and turn it into Latin. That means that technically, we could make two nlu.yml files. One for Latin and one for Greek. We could then have Rasa train on both.”
This is actually similar to what I have done now. I have transformed the examples into a Latin representation.
“is your dataset publicly available? I’ve actually got a small set of tools that I’d love to try out.” Unfortunately we don’t have anything public available yet
Quick question, at some point I used the byte pair embeddings you had suggested in the workshop and it worked better than spacy. However that wouldn’t work with the message transformed in Latin characters, correct? Is there a way to train the byte pair with a new data set?
@kmegalokonomos Funny that you mention the BytePair embeddings. I’m working on a feature over at the nlu-examples repository that will allow you to pass not the embeddings, but the subtokens to be sparsely encoded.
Technically, you could translate latin characters back into Greek and fetch the resulting embedding. This feels a bit experimental, but I suppose you could try and see if it works.
Having said that, I wouldn’t worry about training your own embeddings too much. Assuming you’re using DIET, you’d automatically be training an internal representation for all of your (sub) tokens already. There’s an elaborate thread on this topic here.
This quarter I might start working on a command line tool with paraphrasing tricks. I think paraphrasa might be an awesome name for that project . Originally I wanted to include mainly spelling related tricks, but I’ll likely also add a Greek-Latin translator as well as a demo. Should the time come around, could I poke you @kmegalokonomos for a review?
Also, could you share anything about how effective the latin translation trick is? Are you only training on the latin characters or both the latin and Greek ones?
I’ve started in a Rasa augmentation project here. I’m mainly working on keyboard typos for now, but I certainly wouldn’t mind trying out the Greek letter substitution trick.
Hey Vincent! Nice to hear from you and that this talk actually leads to an episode . Last month was hectic and completely forgot to get back to you with all the work. I am away for the week but do you want to talk about the substitutions a bit next week? they need some work and I can surely help! Regarding your examples, if someone wrote this with Greek characters and wanted it to sound the same as in English it would be:
αρ γιου ε μποτ?
αρ γιου ε χιουμαν?
αμ αΐ τοκινγκ του ε μποτ?
αμ αϊ τοκγινγκ του ε χιουμαν?
You can mail me on cosmeg @ gmail . com and we can see what is the best platform to do a quick talk