How does NLU deal with apostrophes?

AllBecomesGood · July 23, 2019, 12:05am

I’m doing some data augmentation and what I currently use (GitHub - jasonwei20/eda_nlp: Code for the ICLR 2019 Workshop paper: Easy data augmentation techniques for boosting performance on text classification tasks.) removes all apostrophes. This means it turns I’m into either Im or I m (not all apostrophes are made equal I just found out, looking forward to cleaning my data).

Anyways, how does rasa do it? Should I work to get those apostrophes back into my training data or not?

Edit: Now I’m also wondering about other stuff like brackets: “Hi, I’m Jane (Gary’s Wife)”. If you’d ignore the brackets you’d lose meaning, wouldnt you

amn41 · July 23, 2019, 4:33pm

hey @AllBecomesGood ! it depends on the tokenizer you use. If you are using the whitespacetokenizer, then this won’t do anything. If you’re using e.g. the spacy tokenizer, this will depend on the language.

For the supervised embeddings (AKA tensorflow) pipeline, you can customize how words are split using the token_pattern option Components

AllBecomesGood · July 23, 2019, 10:50pm

I am using the supervised_embeddings and had to search for the default settings this uses which are:

Choosing a Pipeline
pipeline:
name: “WhitespaceTokenizer”
name: “RegexFeaturizer”
name: “CRFEntityExtractor”
name: “EntitySynonymMapper”
name: “CountVectorsFeaturizer”
name: “EmbeddingIntentClassifier”

Which means the WhitespaceTokenizer is having some fun with those occurrences of “I m” in my training data. I will clean this up to begin with and investigate other Tokenizers afterwards.

I am still a bit in the dark about whether the bot will learn to understand “I’m” and “Im” as equal, unfortunately the augmentation I use (EDA) really messes quite a bit with the data, it even introduces noise such as reversing the meaning of some training data (and not sure if that’s good or bad, I mean it should be good to a degree I suppose). It has however drastically improved my results/performance given the small amount of training data I have, so I’m keeping it for now.

amn41 · July 24, 2019, 7:13am

good find! there’s a small gotcha there which is that the whitespace tokenizer will be used for entity recognition, but the CountVectorsFeaturizer does its own tokenization (you can choose between word and character ngrams, word boundaries, etc).

The featurizer will not treat I'm and Im as equivalent inputs, so you want to make sure you have both in your training data so that your model learns to treat these the same

Topic		Replies	Views
NLU not predicting entities separated by the '/' character in the new version of Rasa. Why? Rasa Open Source	3	497	June 11, 2020
Tokenization Rasa Open Source	11	694	December 23, 2021
Regex entity names Rasa Open Source	10	2915	April 13, 2020
Rasa NLU Supervised Embeddings Pipeline entity issue Rasa Open Source	2	1613	February 5, 2020
Korean NLU Rasa Open Source	9	2148	October 7, 2019

How does NLU deal with apostrophes?

Related topics