Features for the SKLearnIntentClassifier

IgNoRaNt23 · March 19, 2020, 12:49pm

Hi, im trying to understand how the SKLearnIntentClassifier works exactly. And I couldnt find a resource that explains, which features the IntentClassifier uses.

What surprised me, was the intent classification of messages with only a single token, which does not have a word-vector (and is maybe even OOV). Extra points for the one who can tell me, why there are lots of words in the vocab, which dont have vectors.

So for example a simple greeting in german is ‘Hallo’, which should be recognized as a greet-intent. Another way to say ‘Hallo’ is ‘Moin’. In contrast ‘Moin’ does not have a word-vector. If ‘Moin’ is not part of my training data for greet, the intent is missclassified, but when I add the exact word to the training data, it gets classified corretly with high confidence, even though its still an OOV-word. Other ways to say ‘Hallo’ still get classified poorly. So what features does the intent classifier rely on in this case? Is there a feature for exact matching with a training example?

IgNoRaNt23 · March 23, 2020, 10:41am

So far I learned that small spacy models don’t use pretrained word vectors, but generate them on the fly. As a result, any input gets a (probably unique) word vector, while for larger spacy models it returns a vector with zeroes only. So smaller should perform better if the training data contains a lot of OOV-tokens.

Probably even better at all? What is the advantage of the larger models? Do they generalize better for text message, that are not seen during training?

IgNoRaNt23 · March 23, 2020, 1:34pm

More results:

As stated here https://spacy.io/usage/spacy-101 the vectors for the are context-sensitive. So the vector for ‘cat’ is different in the sentences ‘This is a cat’ and ‘Here is a cat’, but it is similar.

Topic		Replies	Views
Intent classification poor even with exact matches Rasa Open Source	9	1165	June 5, 2020
spaCy and OOV-Tokens Rasa Open Source	1	628	July 29, 2020
Intent Classifier TensorFlow Embedding with SpacyFeaturizer Rasa Open Source	1	1074	October 15, 2019
Custom word vectors and pipelines using them Rasa Open Source	5	1070	March 20, 2020
Failing at intent classification Rasa Open Source	4	788	August 5, 2019

Features for the SKLearnIntentClassifier

Related topics