Discovering synonyms using pre-trained Bert models


The general idea of this topic is that we want to implement a custom component which uses Bert for discovering new synonyms. Let’s break it down a little:

We have created separate files for storing synonyms for each slots and entities. Then populated those files with possible synonyms users may utter when using our chatbot. But adding new synonyms for those files by hand would become rather difficult as time passes. Hence, we are wondering if we could somehow automate this process. The reason we want to use synonyms is that each slots and entities we have defined takes in roughly 20-40 possible values. The chatbot we are developing is closed-domain retrieval-based chatbot for bank. And mapping all the variations of words users could utter to their respective synonyms would make querying our database using exact matches very simple. For test purposes we are using SQLite database for storing our answers. Full-text-search capability of MongoDB have crossed our mind but as there are completely different ways of referring to the same word we have currently given up on that.

So, here is what we are thinking of:

  1. Create custom component which loads in pre-trained Bert model:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model = 'bert-base-uncased')

This model would work the following way:

>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.1073106899857521,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.08774490654468536,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a new model. [SEP]",
  'score': 0.05338378623127937,
  'token': 2047,
  'token_str': 'new'},
 {'sequence': "[CLS] hello i'm a super model. [SEP]",
  'score': 0.04667217284440994,
  'token': 3565,
  'token_str': 'super'},
 {'sequence': "[CLS] hello i'm a fine model. [SEP]",
  'score': 0.027095865458250046,
  'token': 2986,
  'token_str': 'fine'}]

Now this component we are trying to implement takes in user message and extracted entities from the DIET classifier. Then by using start and end indexes of each extracted entities we replace the corresponding entity value with [MASK]. After this we pass each MASK-ed sentences to the above unmasker() function for predicting the masked word.

And here is the important part, or so we believe. This model is predicting words which could be inputted in place of the MASK such that the original sentence context is preserved. Then these words the Bert model is predicting must be synonyms of original words user inputted. Could we then perhaps write these predicted words to synonyms.yml file of each slots and entities?

After these we map entity values we extracted from DIET classifier with their respective synonyms. Then, proceed to ResponseSelector.

  1. In the config.yml file we add this component after the EntitySynonymMapper component. The reason is:
  • If slot/entity values we extracted already exist in synonyms.yml file of each slots/entities then we would proceed to ResponseSelector and not activate above Bert component.
  • Otherwise apply what we described above.

This whole idea sounds somewhat decent/plausible in our head. However, actually implementing is proving to be somewhat challenging.

Now then, thank you for taking the time to read this topic. And please, by all means make suggestions or tips which could help us in this task as well as correcting what we wrote here. Our thought process is not refined at the moment such that we are very much open to any changes and improvements which could be made here.

Maybe fine-tuning the said Bert model using our domain data could make this whole synonym discovery process focus more on predicting domain-wise words. We haven’t tried fine-tuning yet as it requires TPU for training. What do you think about this?

Then, again thanks!