RASA ConveRT and Semantic Similarity issues


I have built a Rasa NLU model using ConveRT featurizer. But I am facing a challenge in below scenario, where the semantics is different for the sentences:

Case 1:

  • I need access to ABCD

  • I do not want access to ABCD

Classification label is same for both cases.

Case 2:

  • What is ABCD - > meaning

  • Who is XYZ -> person

Classification result:

  • ‘What is XYZ’ -> person

  • Who is ABCD -> meaning

Classification is happening based on the entity, not on the semantics.Please help on handling the above scenarios.


Hi @Anand_Menon,

not sure I fully understand the scenario.

The examples you listed for case 1 should have the same intent. They have no entities. Is that correct? Does the model correctly identify the intents?

The example you listed in case 2 should have the same intent but different entities. Is that correct? What is currently happening?

Hi @Tanja,

My query has nothing to do with entity extractions because they seem to work just fine but the issue is with effect of entity in intent classification and model’s inability to understand semantic meaning of the sentence.

Case 1:

I need access to ABCD - > ABCD is the entity and it is extracted correctly

  • I need access to ABCD - > intent : get_access
  • i do not need access to ABCD -> intent: remove_access

The above are totally different intent right the first one being how to have access and the second being how to remove access but the classifier always classify it to have access with very high accuracy [even after adding training data for the remove access cases]. It seems like an issue with the semantics, the model is to not able to distinguish between sentences based on its semantic differences.

Case 2:

Let’s take an example

  • what is football? Ans : Football is a sport (intent : meaning)
  • who is Beckham? Ans: David Beckham is a football legend (intent: who)

Now if my client ask a question like

  • what is Beckham? (intent: who) Ans: David Beckham is a football legend
  • who is Football? (intent : player) Ans : Football is a sports (intent : meaning)

The above questions are logically wrong but still manages to give result due to the weights of entity words like beckham,football which creates a bias towards mapping these sentences to the wrong intents.

Is there any possible way to tackle such issues?

I have tried Google’s universal sentence encoders for semantic similarity but still no hope. I hope my explanations are well enough


Thanks for the detailed explanations.

What pipeline are you currently using? Did you tried different kind of pipelines?

Case 1 is a difficult one, and I’m afraid we don’t have a magic trick that solves this issue. We always recommend to add more training data in that case, but as you already tried it and it did not work, seems to be not an option. You could try out different model settings. Did you already tried using the DIETClassifier, we released in 1.8.0?

Regarding Case 2: Do you have lookup tables in your training data? I’m asking, because if you have some lookup tables and the RegexFeaturizer in your pipeline, the features are used for the intent classification (at least for some components). So, again would be nice to know how your pipeline actually looks like.


Thanks for the quick response.

My current pipeline is follows:

  • name: “WhitespaceTokenizer”
  • name: “RegexFeaturizer”
  • name: “CRFEntityExtractor”
  • name: “EntitySynonymMapper”
  • name: “CountVectorsFeaturizer”
  • name: “ConveRTFeaturizer”
  • name: “EmbeddingIntentClassifier”

Yeah i know case 1 is a bit hard scenario to tackle & was basically looking for a suggestion. Thanks for DIETClassifier, I will definitely give it a try.

Regarding case 2: Currently i am not using the lookup table but RegexFeaturizer is a critical component in our use case. The sample question that i have asked has nothing to do with RegexFeaturizer.

Is there any tuning that needs to be done on the above pipline?


Pipeline looks good, but I would assume you would achieve better results if you are using the following pipeline:

  - name: ConveRTTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper

Note: You need to update to Rasa 1.8.1 for that. It might help for both scenarios. Just be aware that training may take longer.

Regarding the RegexFeaturizer: Just wanted to check why so much weight was given to the proper nouns in Case 2. What kind of regexes are you using?

Thanks for the above pipeline,I will give it a try and will let you know about the results. I am currently using regex to identify and extract basic info like credit card numbers, user id, phone numbers,email …etc.