How can we handle spelling mistakes in general for better intent classification? I am using intent classifier in tensorflow pipeline and it is not able to generalize well for inputs with spelling mistakes even close to the ones in training data. Can anyone suggest a way to handle this? Thanks!
Another way might be to write a script that takes your nlu file to create additional examples based on your original ones, but including spelling errors.
Having an NLU trained on all possible typos is going to be annoying if not unsustainable; Are you really going to think of all the mis-spellings, cultural memes, and dialect variations to even create the NLU training data? I couldnât so I had to use a correction service.
For my implementation I just use slack so it was simple to put a simple Levenstien distance based spell checker into the python slack channel message handler directly, but I think that a more generalised and sustainable way is to try to use a custom component as @huberrom mentioned.
Hi, I have tried as per steps on Custom component for Spell checking but its throwing an error âmessage.textâ with Rasa 2.0. You can get the text using message['text]. But I still couldnât find a way to set text to new_message. (message[âtextâ] = new_message doesnât work).
Can anyone please help with this?
Iâm currently doing a bit of research on this topic. My belief is that itâs dangerous to use spell checkers because they will get it wrong too sometimes. Especially when you apply them to short sentences.
Instead Iâm trying to figure out if it makes sense to augment the training data beforehand to have spelling errors in it. Iâve done some experiments nlpaug that look promising. Itâs essentially what @mauricedoepke suggests but what Iâm trying to figure out is if it makes sense to add more training data or if you can instead apply a --finetune trick. Note that all of this is work in progress and should not be interpreted as something that will âalways workâ, but itâs worth experimenting.
Hi @joancipria ,
Iâm using Rasa 2.3.4 and this message.data["text"] works for me. You should try to see what you have inside âdataâ dict to check in wich property do you have the message.