How does NLU deal with apostrophes?

I’m doing some data augmentation and what I currently use (GitHub - jasonwei20/eda_nlp: Code for the ICLR 2019 Workshop paper: Easy data augmentation techniques for boosting performance on text classification tasks.) removes all apostrophes. This means it turns I’m into either Im or I m (not all apostrophes are made equal I just found out, looking forward to cleaning my data).

Anyways, how does rasa do it? Should I work to get those apostrophes back into my training data or not?

Edit: Now I’m also wondering about other stuff like brackets: “Hi, I’m Jane (Gary’s Wife)”. If you’d ignore the brackets you’d lose meaning, wouldnt you

hey @AllBecomesGood ! it depends on the tokenizer you use. If you are using the whitespacetokenizer, then this won’t do anything. If you’re using e.g. the spacy tokenizer, this will depend on the language.

For the supervised embeddings (AKA tensorflow) pipeline, you can customize how words are split using the token_pattern option Components

1 Like

I am using the supervised_embeddings and had to search for the default settings this uses which are:

  • Choosing a Pipeline
  • pipeline:
  • name: “WhitespaceTokenizer”
  • name: “RegexFeaturizer”
  • name: “CRFEntityExtractor”
  • name: “EntitySynonymMapper”
  • name: “CountVectorsFeaturizer”
  • name: “EmbeddingIntentClassifier”

Which means the WhitespaceTokenizer is having some fun with those occurrences of “I m” in my training data. I will clean this up to begin with and investigate other Tokenizers afterwards.

I am still a bit in the dark about whether the bot will learn to understand “I’m” and “Im” as equal, unfortunately the augmentation I use (EDA) really messes quite a bit with the data, it even introduces noise such as reversing the meaning of some training data (and not sure if that’s good or bad, I mean it should be good to a degree I suppose). It has however drastically improved my results/performance given the small amount of training data I have, so I’m keeping it for now.

good find! there’s a small gotcha there which is that the whitespace tokenizer will be used for entity recognition, but the CountVectorsFeaturizer does its own tokenization (you can choose between word and character ngrams, word boundaries, etc).

The featurizer will not treat I'm and Im as equivalent inputs, so you want to make sure you have both in your training data so that your model learns to treat these the same

1 Like