How to add Lots of new words in the pre-trained data

Hello,

I am very new to Rasa and NLP, spent reading documentation and video and finally have gathered courage to dirty my hands with chatbot creation. I am trying to make a chatbot for medical shop. Which will provide information about medicine and suggest on doses etc based on person’s profile.

I have a basic chat bot ready, but the issue is that chatbot is not recognizing the medicine names beyond what I have trained for. I need to tell the chatbot about all the medicines, which is a big list. I can generate a list of medicines or I can collect articles from net about medicines. But I am not sure how to train the chatbot for this list. At present my pipeline looks like this (same as created by default)

pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 100 constrain_similarities: true
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100 constrain_similarities: true
  • name: FallbackClassifier threshold: 0.3 ambiguity_threshold: 0.1

As messages from user will be in English, i want to continue using pre-trained data, but I wish to amend the pre-trained data with my domain specific words. Is there any other way to do it, I thought of using lookup table, but not sure if this will be a good idea, given there can be thousands of medicines and people can write name in some different orders.

Please guide me. Thanks, Abhishek

Hi !

For this, I’m not an expert but I use lookup table ; I already build a lookup table with “French firstnames list of 2019”, so some thousands of example, and it works well to detect entity, associated with some intent training :slight_smile:

1 Like

Thanks, I will try that. Did you experience any issue related to “multiple words” in one entry in the lookup table, example.

Because i am suspecting many such cases in my lookup table.

for large lists, check Rasa NLU Examples

Flash Text is generally more preferable to use with lookup tables with large number of data over Regex. You can also check this article about why

https://alibaba-cloud.medium.com/why-you-should-use-flashtext-instead-of-regex-for-data-analysis-960a0dc96c6a

In the Rasa NLU examples repo… FlashText is used for Lookups of Large List that uses an exact matches. I would imagine for names of Medicines, it is pretty unique most of the times, i suppose.

1 Like

Thank you, I will look at this option.

Hello,

I tried regexp, the issue is that extracted entities are not all the words that are there in the lookup. I wanted a way to extract the whole part of the lookup. From my understanding I needed a way to do some featurization of whole string, and use that to find the best match from the user’s input.

I have followed steps mentioned in here https://towardsdatascience.com/give-some-semantic-love-to-your-keyword-search-c35f16df2ee I create a file ‘action_helper.py’, which has a class LookupExtractor, the class creates a model to extract full text from lookup table. In my rasa chatbot, I get the slots filled using form, where I ask user for various items in the form.

My thoughts, when bot receives a sentence from user for a form question, ValidateForm is called. in the ValidateForm, I initialize the LookupExtractor to create the embedded vector for each lookup based on model en_core_web_sm .

When validateForm is called for the lookup item, i pass the whole sentence to the lookupExtractor, and it returns the full text from lookup based on extraction. I set the extracted value to the slot “medName”

This solution works, but I believe this is not optimal

  • in lookupExtractor initialization, I load en_core_web_sm using spacy, the same is done by rasa pipeline (so model is loaded two times)

  • When validateForm is called, the slot medName is filled by NLU. Then using lookup extraction, I fill the slot again using 'return {“med_name”: med_name}`. When running rasa in interaction mode, I see the slot has two value, the original that is extracted by nlu processor, and then the value I put after lookupExtraction. which I think is not correct, because probably the older value will also influence conversation.

Is there any way, to remove the older value of slot medName, before entering the new name?

With the very minimal knowledge of Rasa, I think I need to write custom featurizer and custom classifier. Please correct me if I am wrong! If its true, please point me to a simple example or tutorial for this. I could not find one, which can be easy enough to understand it and use to create my own.

Thanks, Abhishek