Does punctuation affect the model?

For example, does adding/removing question marks in training examples for intents affect the prediction model? Similarly, does a user’s utterance containing different punctuation like a question mark or other affect the prediction model? I assumed not at first but Ive seen some random instances where punctuation does seem to have an affect. If so, we will probably want to add something to our pipeline that strips punctuation out to avoid random bias based on punctuation…

Hi Tatiana! It depends on what pipeline you’re using. If you’re using a tokenizer that treats punctuation as tokens, like the SpaCy tokenizer, yes.

If you’re curious about the effect of punctuation on your model, I’d suggest an A/B test before removing it entirely. There is some information present in punctuation and removing that may end up changing the performance of your models. (Consider the difference between “There’s a bear!” and “There’s a bear?”, for example. You’d probably want those two utterances classified into different intents.)

Hope that helps! :slight_smile:

I think the addition and omission of punctuation really affect this. By the way, recently I decided to start learning English grammar again. Unfortunately, I have not studied English grammar for a long time and have forgotten a lot of things. Now I have started studying point of view worksheet 1, which helps me remember how I can grammatically correctly express my emotions and feelings in the text. Similar worksheets really help to remember many grammatical aspects that you could forgot.

hello,

I have got the same problem. It is true that SpaCy tokenizer considers ponctuations as tockens but as i have remarked form the pkl file of the countVectorFeaturizer that ponctuations are not cnsidered.

Any explication please ?