Features for the SKLearnIntentClassifier

Hi, im trying to understand how the SKLearnIntentClassifier works exactly. And I couldnt find a resource that explains, which features the IntentClassifier uses.

What surprised me, was the intent classification of messages with only a single token, which does not have a word-vector (and is maybe even OOV). Extra points for the one who can tell me, why there are lots of words in the vocab, which dont have vectors.

So for example a simple greeting in german is ‘Hallo’, which should be recognized as a greet-intent. Another way to say ‘Hallo’ is ‘Moin’. In contrast ‘Moin’ does not have a word-vector. If ‘Moin’ is not part of my training data for greet, the intent is missclassified, but when I add the exact word to the training data, it gets classified corretly with high confidence, even though its still an OOV-word. Other ways to say ‘Hallo’ still get classified poorly. So what features does the intent classifier rely on in this case? Is there a feature for exact matching with a training example?

So far I learned that small spacy models don’t use pretrained word vectors, but generate them on the fly. As a result, any input gets a (probably unique) word vector, while for larger spacy models it returns a vector with zeroes only. So smaller should perform better if the training data contains a lot of OOV-tokens.

Probably even better at all? What is the advantage of the larger models? Do they generalize better for text message, that are not seen during training?

More results:

As stated here https://spacy.io/usage/spacy-101 the vectors for the are context-sensitive. So the vector for ‘cat’ is different in the sentences ‘This is a cat’ and ‘Here is a cat’, but it is similar.