(hey Sams yes I am currently experimenting with custom components)
Hey @koaning , first of all let me quickly say that I am working with the Greek language. Regarding CountVerctorsFeaturizer we 're having some problems with the ngrams. Mainly we get miss-classifications because one word is a substring of another word or they have the same origin. Lets say “εισερχομενες” and “εξερχομενες” which means incoming and outgoing. This happens a lot in the Greek language. I guess an example in English could be “classif-ication” and “publ-ication”. Also believe it or not there is a thing a “new” language in Greek called Greeklish. That means people writing Greek but with English characters because they re bored to switch keyboards (mostly young people). So one would write “εισερχομενες” and another one would write “eiserxomenes”. Its is like the people are actually doing some sort of phonetic processing in they re head while writing. This would not work with simple ngrams. I have thought of two ways to tackle this:
One would be to create a custom preprocessor (already have) and change the message to phonetics based message and then run message.set(text) to change the message for the next components in the pipeline. This would also handle handle intent examples during training. Or we could do a similar processing on the tokens in the tokenizer. Or finally we could do the same processing inside the CountVectorsFeaturizer as part of the _process_message method. In the end we would use countvectors but based on the processed message.
The second would be to create some sort of PhoneticsFeaturizer that creates features (additional to the CountVectors) that provide insight on how close sounding is a user utterance to the examples. (This is still not 100% clear in my head - I am still thinking of it)
We already have the code to create a phonetics representation of a string in Greek. If we proceeded with solution 1 above where would be the best place to apply it?
Honestly appreciate your input on this,
P.S Btw we really enjoyed your presentation at the Rasa Summit, thank you for that! !