Confidence Score Computations

Hi Everyone.

I was trying to understand what the confidences score outputted by rasa nlu actually are and how they are computed.

I have been working on intent classification task with tensorflow embedding. Once my model is trained and I parse new/test data, I receive a confidence score along with each probable intent. But I have little to no idea of what actually this confidence score represents.

As mentioned in docs, it does not represent probability. And after some observation of results, it seems to be a one-many type evaluation i.e. for a single text input, I can get multiple intents with high confidence scores.

After having a quick look at the code, I think it is computed in “_tf_sim” function in “” file (Relevant code segment below)

Can somebody please confirm/clarify on how or what confidence score means here?

def _tf_sim(self, a, b):
    """Define similarity"""

    if self.similarity_type == 'cosine':
        a = tf.nn.l2_normalize(a, -1)
        b = tf.nn.l2_normalize(b, -1)

    if self.similarity_type == 'cosine' or self.similarity_type == 'inner':
        sim = tf.reduce_sum(tf.expand_dims(a, 1) * b, -1)

        # similarity between intent embeddings
        sim_emb = tf.reduce_sum(b[:, 0:1, :] * b[:, 1:, :], -1)

        return sim, sim_emb
        raise ValueError("Wrong similarity type {}, "
                         "should be 'cosine' or 'inner'"
1 Like

you are aware of what the pipeline for the embedding classifier actually does?

So in short and maybe I am not 100% correct but

Step 1 - Tokenisation of your training data

Step 2 - Featurization - the pipeline make a word embedding on a high dimensional plane or simply assign a vector value for each word using the Bag-of-words approach meaning words that are likely similar will be closer to each other. this distance is measured using the cosine distance between the two vector. the same approach is taken for the intents as well

Step 3 - Fit - The embeddings are then fit into a non-linear classification to find the best possible classes, so when new sentence or your test sentence is given to the model, it tries to find similarity( cosine distance) between the sentence in the test set as what was predicted to what it should be.


Hey @souvikg10, Thanks for your response.

It was quite helpful. Based on your description, I gather that confidence score is basically a similarity score/metrics which gives the the similarity of the input text (‘a’ in above code) with the embeddings for a certain class( ‘b’ in above code) or more explicitly embedding mapping from the utterance to some class.

Could you please confirm my understanding of this?

Indeed but I would also say that you should run some evaluation using cross validation to get an F1 score of your training set to verify over fitting, this pipeline can also overfit

Thanks !! Will take your advice into account for sure :slight_smile: