DIETClassifier with sparse input features only

adai183 · January 15, 2021, 2:27pm

I just saw that the default config for NLU is a DIETClassifier with 2 CountVectorsFeaturizers, one for word-level embeddings and one for embeddings for character n-grams. As far as I understood the DIET architecture relies on pretrained embeddings. I am having a hard time imagining this architecture with sparse bag of words input features only. Can you please point me in the right direction on how this works???

Thank you

adai183 · January 17, 2021, 10:19am

Found my answers. Everything is perfectly explained in this video.

adai183 · January 19, 2021, 10:31am

@koaning Just a quick question concerning the DIET approach, when used with sparse features only. I can see that the default config is a DIETClassifier with 2 CountVectorsFeaturizers, one for word-level embeddings and one for embeddings for character n-grams. I can also see that the default for the DIETClassifier is number_of_transformer_layers=2.

So here is my question: Does it really make sense to use transformer layers inside the DIETClassifier, when dealing with sparse features only? How can a transformer even work without being able to grab pretrained word embeddings?

koaning · January 19, 2021, 11:49am

Let’s have a look at DIET.

Let’s start by removing the MASKING bit.

Let’s also remove the pre-trained embeddings since you mention we’re only using sparse features.

Let us now consider that is happening at the bottom two feed-forward layers right after the sparse features. Between the transformer and the sparse features, we are applying two feed-forward layers.

These feed-forward layers … are causing the sparse features to be turned into embeddings. These embeddings will represent (sub)tokens and they are trained on the labels provided by the system (intents and entities). These embeddings are also plain vectors, just like vectors from word embeddings.

So why are transformers still a good idea here? It is because the attention from the perspective of the __CLS__ token might still be more directly influenced by separate word tokens. DIET, internally, is making embeddings too! It’s just that we typically do not expose them.

For more details on this you might appreciate this video on the topic.

adai183 · January 19, 2021, 12:25pm

Wow. This is really interesting. Thanks for the quick response. Just to make sure I understood everything correctly:

The vector containing the sparse features (circled in orange) is exactly the same for each token (["play", "ping", "pong"]) in the sequence. We repeat the same sparse input features for each token. The intuition behind this is that the FF layers take care of converting the same input into smething like an pretrained embedding for each token, which can be contextualized by the following transformer layers?

Thanks for the video link. I am a big fan of the whole playlist.

koaning · January 19, 2021, 1:21pm

Assuming we’re using no subwords, the mental picture is similar to this;

Note that the sparse representation for the entire utterance can be interpreted as the sum of the separate tokens.

Let’s zoom in on a sparse encoding followed by a single embedding layer.

When we have a ‘1’ input then the weights from the feedforward layer matter. Otherwise, we multiply a weight times zero which always equals zero.

If now, we’d have a sparse input for a sentence, more weights would matter and thus the output embedding would be different.

So by merit of linear algebra, the dense representation of the sparse embeddings can also be interpreted as the sum. Note that in these diagrams I’m only looking at the first embedding tlayer hat is applied. Also, I’m ignoring any activations that theoretically could be in there.

adai183 · January 19, 2021, 2:01pm

Thanks again. This means my intuition was completely wrong. lol

So the sparse feature vector for ["play", "ping", "pong"] is the corresponding one-hot encoded vector.

The sparse feature vector for the __CLS__ token is the resultant sum of these one-hot encoded vectors.

I can see how this combined with the FF-layes works like an embedding lookup. So I suppose that the weights between the FF-layers are shared.

koaning · January 19, 2021, 4:04pm

Yes.

Be aware though that this is not generally the case for dense BERT-style embeddings. These embeddings have an attention mechanism that is strictly non-linear.

adai183 · January 19, 2021, 4:19pm

You built something really cool! I am looking forward to trying it out and draw the comparison to more classic approaches.

koaning · January 19, 2021, 4:55pm

Just tagging @Ghostvv, @amn41, @dakshvar22 and @Tanja for the good vibes (I didn’t design DIET, I merely explain it )

Topic		Replies	Views
Question about DIET classifier implementation details - Are featurizers trained? (and others) Rasa Open Source	1	331	May 4, 2023
Semantic Hashing with DIETClassifier Rasa Open Source	2	352	May 24, 2021
DIET Architecture - Individual Token Pathway Feedback on Rasa Open Source	4	904	November 14, 2021
DIETClassifier: Where do pretrained embeddings come from? Rasa Open Source	2	1266	July 28, 2020
Using DIETClassifier with ConveRT and Transfomer based featurizers Rasa Open Source	7	1926	April 3, 2020

DIETClassifier with sparse input features only

Related topics