Let’s have a look at DIET.
Let’s start by removing the MASKING bit.
Let’s also remove the pre-trained embeddings since you mention we’re only using sparse features. I will also remove the masking part since that’s probably not relevant either.
Let us now consider that is happening at the bottom two feed-forward layers right after the sparse features. Between the transformer and the sparse features, we are applying two feed-forward layers.
These feed-forward layers … are causing the sparse features to be turned into embeddings. These embeddings will represent (sub)tokens and they are trained on the labels provided by the system (intents and entities). These embeddings are also plain vectors, just like vectors from word embeddings.
So why are transformers still a good idea here? It is because the attention from the perspective of the
__CLS__ token might still be more directly influenced by separate word tokens. DIET, internally, is making embeddings too! It’s just that we typically do not expose them.
For more details on this you might appreciate this video on the topic.