DIETClassifier with sparse input features only

Assuming we’re using no subwords, the mental picture is similar to this;

Note that the sparse representation for the entire utterance can be interpreted as the sum of the separate tokens.

Let’s zoom in on a sparse encoding followed by a single embedding layer.

image

When we have a ‘1’ input then the weights from the feedforward layer matter. Otherwise, we multiply a weight times zero which always equals zero.

image

If now, we’d have a sparse input for a sentence, more weights would matter and thus the output embedding would be different.

So by merit of linear algebra, the dense representation of the sparse embeddings can also be interpreted as the sum. Note that in these diagrams I’m only looking at the first embedding tlayer hat is applied. Also, I’m ignoring any activations that theoretically could be in there.

1 Like