DIETClassifier with sparse input features only

I just saw that the default config for NLU is a DIETClassifier with 2 CountVectorsFeaturizers, one for word-level embeddings and one for embeddings for character n-grams. As far as I understood the DIET architecture relies on pretrained embeddings. I am having a hard time imagining this architecture with sparse bag of words input features only. Can you please point me in the right direction on how this works???

Thank you

Found my answers. Everything is perfectly explained in this video.

1 Like

@koaning Just a quick question concerning the DIET approach, when used with sparse features only. I can see that the default config is a DIETClassifier with 2 CountVectorsFeaturizers, one for word-level embeddings and one for embeddings for character n-grams. I can also see that the default for the DIETClassifier is number_of_transformer_layers=2.

So here is my question: Does it really make sense to use transformer layers inside the DIETClassifier, when dealing with sparse features only? How can a transformer even work without being able to grab pretrained word embeddings?

Let’s have a look at DIET.

Now, let’s remove the pre-trained embeddings since you mention we’re only using sparse features. I will also remove the masking part since that’s probably not relevant either.

Let us now consider that is happening at the bottom two feed-forward layers right after the sparse features. Between the transformer and the sparse features, we are applying two feed-forward layers.

These feed-forward layers … are causing the sparse features to be turned into embeddings. These embeddings will represent (sub)tokens and they are trained on the labels provided by the system (intents and entities). These embeddings are also plain vectors, just like vectors from word embeddings.

So why are transformers still a good idea here? It is because the attention from the perspective of the __CLS__ token might still be more directly influenced by separate word tokens. DIET, internally, is making embeddings too! It’s just that we typically do not expose them.

For more details on this you might appreciate this video on the topic.


Wow. This is really interesting. Thanks for the quick response. Just to make sure I understood everything correctly:

The vector containing the sparse features (circled in orange) is exactly the same for each token (["play", "ping", "pong"]) in the sequence. We repeat the same sparse input features for each token. The intuition behind this is that the FF layers take care of converting the same input into smething like an pretrained embedding for each token, which can be contextualized by the following transformer layers?

Thanks for the video link. I am a big fan of the whole playlist.

Assuming we’re using no subwords, the mental picture is similar to this;

Note that the sparse representation for the entire utterance can be interpreted as the sum of the separate tokens.

Let’s zoom in on a sparse encoding followed by a single embedding layer.


When we have a ‘1’ input then the weights from the feedforward layer matter. Otherwise, we multiply a weight times zero which always equals zero.


If now, we’d have a sparse input for a sentence, more weights would matter and thus the output embedding would be different.

So by merit of linear algebra, the dense representation of the sparse embeddings can also be interpreted as the sum. Note that in these diagrams I’m only looking at the first embedding tlayer hat is applied. Also, I’m ignoring any activations that theoretically could be in there.

1 Like

Thanks again. This means my intuition was completely wrong. lol

So the sparse feature vector for ["play", "ping", "pong"] is the corresponding one-hot encoded vector.

The sparse feature vector for the __CLS__ token is the resultant sum of these one-hot encoded vectors.

I can see how this combined with the FF-layes works like an embedding lookup. So I suppose that the weights between the FF-layers are shared.


Be aware though that this is not generally the case for dense BERT-style embeddings. These embeddings have an attention mechanism that is strictly non-linear.

1 Like

You built something really cool! I am looking forward to trying it out and draw the comparison to more classic approaches.

Just tagging @Ghostvv, @amn41, @dakshvar22 and @Tanja for the good vibes (I didn’t design DIET, I merely explain it :wink:)