Handling Spelling Mistakes in NLU

Hi there,

How can we handle spelling mistakes in general for better intent classification? I am using intent classifier in tensorflow pipeline and it is not able to generalize well for inputs with spelling mistakes even close to the ones in training data. Can anyone suggest a way to handle this? Thanks!


You can a spell checker in your pipeline if you create a custom component (Custom Components) to correct the mistakes before the classification.


Another way might be to write a script that takes your nlu file to create additional examples based on your original ones, but including spelling errors.


Having an NLU trained on all possible typos is going to be annoying if not unsustainable; Are you really going to think of all the mis-spellings, cultural memes, and dialect variations to even create the NLU training data? I couldn’t so I had to use a correction service. For my implementation I just use slack so it was simple to put a simple Levenstien distance based spell checker into the python slack channel message handler directly, but I think that a more generalised and sustainable way is to try to use a custom component as @huberrom mentioned.


You can write custom component for this task instead of adding spelling mistakes in training data.

you can refer this article,

Custom component for Spell checking

1 Like

Hi, I have tried as per steps on Custom component for Spell checking but its throwing an error “message.text” with Rasa 2.0. You can get the text using message['text]. But I still couldn’t find a way to set text to new_message. (message[“text”] = new_message doesn’t work). Can anyone please help with this?

1 Like

Hi Pradeep,

This worked for me overwriting the message: message.set(‘text’, new_message, add_to_output=True)

1 Like

Hi! You must use message.data['text'] instead of message['text'] in Rasa 2.0

I’m also following the same medium tutorial as @pradeepbatchu but I can’t get the text message. I’ve tried in the following ways:

using message.get("text") I get 'NoneType' object has no attribute 'split'.

using message.data["text"] I get KeyError: 'text'

using textdata = message["text"] I get TypeError: 'Message' object is not subscriptable

I’ve used them, in the folowing code:

    # textdata = message.text (old way)
    textdata = message.get("text")
    # textdata = message.data["text"] (@joseferrerglobant way)
    textdata = textdata.split()
    new_message = ' '.join(spell.correction(w) for w in textdata)
    # message.text = new_message (old way)
    message.set('text', new_message, add_to_output=True)

How can I correctly get the message?

I’m currently doing a bit of research on this topic. My belief is that it’s dangerous to use spell checkers because they will get it wrong too sometimes. Especially when you apply them to short sentences.

Instead I’m trying to figure out if it makes sense to augment the training data beforehand to have spelling errors in it. I’ve done some experiments nlpaug that look promising. It’s essentially what @mauricedoepke suggests but what I’m trying to figure out is if it makes sense to add more training data or if you can instead apply a --finetune trick. Note that all of this is work in progress and should not be interpreted as something that will “always work”, but it’s worth experimenting.

Hi @joancipria , I’m using Rasa 2.3.4 and this message.data["text"] works for me. You should try to see what you have inside “data” dict to check in wich property do you have the message.

You can check the code on Message class:

rasa/message.py at 2.4.x · RasaHQ/rasa (github.com)