I just saw that the default config for NLU is a DIETClassifier with 2 CountVectorsFeaturizers, one for word-level embeddings and one for embeddings for character n-grams. As far as I understood the DIET architecture relies on pretrained embeddings. I am having a hard time imagining this architecture with sparse bag of words input features only. Can you please point me in the right direction on how this works???
@koaning Just a quick question concerning the DIET approach, when used with sparse features only.
I can see that the default config is a DIETClassifier with 2 CountVectorsFeaturizers, one for word-level embeddings and one for embeddings for character n-grams. I can also see that the default for the DIETClassifier is number_of_transformer_layers=2.
So here is my question: Does it really make sense to use transformer layers inside the DIETClassifier, when dealing with sparse features only? How can a transformer even work without being able to grab pretrained word embeddings?
Let us now consider that is happening at the bottom two feed-forward layers right after the sparse features. Between the transformer and the sparse features, we are applying two feed-forward layers.
These feed-forward layers … are causing the sparse features to be turned into embeddings. These embeddings will represent (sub)tokens and they are trained on the labels provided by the system (intents and entities). These embeddings are also plain vectors, just like vectors from word embeddings.
So why are transformers still a good idea here? It is because the attention from the perspective of the __CLS__ token might still be more directly influenced by separate word tokens. DIET, internally, is making embeddings too! It’s just that we typically do not expose them.
The vector containing the sparse features (circled in orange) is exactly the same for each token (["play", "ping", "pong"]) in the sequence. We repeat the same sparse input features for each token. The intuition behind this is that the FF layers take care of converting the same input into smething like an pretrained embedding for each token, which can be contextualized by the following transformer layers?
Thanks for the video link. I am a big fan of the whole playlist.
So by merit of linear algebra, the dense representation of the sparse embeddings can also be interpreted as the sum. Note that in these diagrams I’m only looking at the first embedding tlayer hat is applied. Also, I’m ignoring any activations that theoretically could be in there.
Be aware though that this is not generally the case for dense BERT-style embeddings. These embeddings have an attention mechanism that is strictly non-linear.