can anyone explain me what is normalization is in rasa?
do you mean normalization of confidence scores? The softmax confidence scores depend on number of prediction classes. Since you can have ever growing list of intents or actions that will be predicted, the confidences of predicted actions decrease, but not due to the fact that an algorithm becomes less sure that this is the best class, but rather because number of possible classes increased. So we renormalize the confidence over top 10 predictions making them more or less independent from the total number of possible classes
i want to know more about normalization and data cleaning that is found in a theory…and if am developing a chatbot with rasa then how this things are considered in rasa?
no, the first screenshot is about statistical distributions and the second one represents a confusion matrix between different predictions
so can the second one be called normalization?
no, it represents how often one intent is confused with another
how about this…am trying to understand the word normalization in a chatbot ,machine learning and nlp
“normalization” is a bit of a vague term but I’ll try to highlight some of the important bits here.
Typically, machine learning algorithms need numeric data. Text isn’t numeric so that means that we have to translate our text. Usually, text is first turned into tokens and these tokens are then turned into “something numeric”.
Text normalisation typically occurs in the step where you go from text to tokens. This step is often called “tokenisation” and it isn’t as trivial as you might think.
Take the sentence “it is nice weather, isn’t it?”. There’s a few ways you could split it up.
["It", "is", "nice", "weather", ",", "isnt", "it", "?"] ["It", "is", "nice", "weather", ",", "is", "n't", "it", "?"] ["it", "is", "nice", "weather", "is", "n't", "it"]
In the first example we split it up by splitting on the whitespace character
" ". In the second example we’re trying to be a bit more clever because we also split up
"n't". In the final example we go even a step further by removing all the punctuation and capitalisation.
Which one of these approaches is best? Well … that’s kind of hard to say. It really depends on your use-case. This is why in Rasa we allow you to customise this.
- You can use a WhitespaceTokenizer in your
config.ymlfile to split up the tokens using a
" "and you can also configure it to make all the characters lowercase and ignore punctuation/emoij.
- You can use a SpacyTokenizer in your
config.ymlfile to split up the tokens using spacy. This tool has some clever heuristics built in to split up the characters such that
"isn't"is turned into
These are some examples of what you could call “normalisation”. I hope it’s clear that a machine learning pipeline will demonstrate different behavior if you pick a different normalisation/tokenisation strategy.
There’s some other details as well … but before going in depth there it might be good to stand still and wonder “is this super important”? Typically, the main issue when you’re designing a chatbot isn’t hyperparameters like “are we normalising text the right way”. Sure, there might be an optimisation possible, but it’s not the main issue.
The main issue is usually data quality and figuring out how your users want to use your assistant. The main reason why I want to mention this is that it can be very easy to get lost in “all the things we could optimise”. It’s better to worry about getting a representative dataset first. Without it, optimisation quickly becomes a silly exercise.
I’m actually working on content for tokenisation/preprocessing for the algorithm whiteboard playlist. If there’s any core questions you have; feel free to mention them here. I’ll keep them in mind as I make more videos.
thank you so much koaning, it is a nice explanination ,now i understand a bit about what normalization is and now i can write about what you told me in my paper resarch…am actually a student and am developing a chatbot with rasa…i found rasa such an interensting machine learning tool to build an assistant like chatbots…while i was writing a resarch paper for my project i got stucked at the middle when i rich "data cleaning " and “normalization”…what am i suppose to write about this things and how they are defined in rasa and i was looking for a person who can make a little bit discription for this and i got one thank you so much…and i would also appreciate if u could explain about "data cleaning " like you did on normalization too thank you so much
and just to make it clear that if u specifiy those rasa piplines as used for normalization then not only two of them but other piplines has a purpose for data normilation too??? am i correct?
This question isn’t very clear to me, could you rephrase?
The whole topic of “data quality” usually boils down to this drawing (assuming we’re talking about chatbots):
You want to be sure that whatever algorithm you optimise … that you actually score it on data that is relevant to users. If your users write lots of spelling errors, then maybe you do not need to clean them up. After all, your algorithm needs to be able to handle them.
In a similar fashion; if you have a subset of intents that you can predict very accurately but none of your users ever utter these intents … then maybe this is a good reason to remove them from your training data.
Data quality is not so much a “general goal of best practices” rather it is more about “understanding your specific problem and making sure that your data resembles it effectively”. Text normalisation techniques can be a part of it, but this will always depend on your use-case.
you give some examples on normalization and how rasa allow us to customise those examples by using some pipelines in config.yml file…did u got me now,then my question what about the other pipeline components other than whitespace tokenizer and spacy tokenizer??? for example synonms for that there is EntitySynonmMapper so is that also can be considerd as data normaliaztions??
Sure, that’d fall under the data quality banner.
so which of the algorithms are responsible for cleaning those kind of datas or be able to handle those errors in rasa?
It’s hard for me to understand what you’re referring to when you type “those kind of datas”. What text do you mean in particular? You can quote text here by using the
> markdown syntax.
when you say depending on my usecase,assume am developing a medical chatbot then how it will be defined in that manner?
This is something that I cannot judge from here. Medical chatbots cover a large array of possible applications so it is better that you think yourself of what the users of this chatbot would need.
sorry ,when i say those kind of datas means the part you say " If your users write lots of spelling errors, then maybe you do not need to clean them up. After all, your algorithm needs to be able to handle them." and i asked what algorithm can be used to handle those kind of errors
Part of the solution lies not in the algorithm, but the preprocessing instead. You might enjoy this video on countvectorizers.
okay i will check it out
and also is “def submit” function not gonna work or what in new rasa 2.0??
You’ll need to be clearer there. What do you mean with