TED Architecture Question

Hello, this is less of a problem I’m having with Rasa, but more of something that came to mind as I was working with TED; does anyone have a sense of why we use the similarity metric (like StarSpace) to compare actions instead of directly predicting an action from the transformer/FF block – the only thing I can think of is if we use the Action feature vector elsewhere (like in Starspace it’s nice to have the two objects in the same feature space) but here i’m not clear why a cosine similarity would be better than a softmax over the action?

