How to add Lots of new words in the pre-trained data

ashek1520 · June 3, 2021, 9:00pm

Hello,

I am very new to Rasa and NLP, spent reading documentation and video and finally have gathered courage to dirty my hands with chatbot creation. I am trying to make a chatbot for medical shop. Which will provide information about medicine and suggest on doses etc based on person’s profile.

I have a basic chat bot ready, but the issue is that chatbot is not recognizing the medicine names beyond what I have trained for. I need to tell the chatbot about all the medicines, which is a big list. I can generate a list of medicines or I can collect articles from net about medicines. But I am not sure how to train the chatbot for this list. At present my pipeline looks like this (same as created by default)

pipeline:

name: WhitespaceTokenizer

name: RegexFeaturizer

name: LexicalSyntacticFeaturizer

name: CountVectorsFeaturizer

name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4

name: DIETClassifier epochs: 100 constrain_similarities: true

name: EntitySynonymMapper

name: ResponseSelector epochs: 100 constrain_similarities: true

name: FallbackClassifier threshold: 0.3 ambiguity_threshold: 0.1

As messages from user will be in English, i want to continue using pre-trained data, but I wish to amend the pre-trained data with my domain specific words. Is there any other way to do it, I thought of using lookup table, but not sure if this will be a good idea, given there can be thousands of medicines and people can write name in some different orders.

Please guide me. Thanks, Abhishek

ishibu · June 4, 2021, 5:43am

Hi !

For this, I’m not an expert but I use lookup table ; I already build a lookup table with “French firstnames list of 2019”, so some thousands of example, and it works well to detect entity, associated with some intent training

ashek1520 · June 4, 2021, 10:45am

Thanks, I will try that. Did you experience any issue related to “multiple words” in one entry in the lookup table, example.

Because i am suspecting many such cases in my lookup table.

souvikg10 · June 4, 2021, 11:41am

for large lists, check Rasa NLU Examples

Flash Text is generally more preferable to use with lookup tables with large number of data over Regex. You can also check this article about why

https://alibaba-cloud.medium.com/why-you-should-use-flashtext-instead-of-regex-for-data-analysis-960a0dc96c6a

In the Rasa NLU examples repo… FlashText is used for Lookups of Large List that uses an exact matches. I would imagine for names of Medicines, it is pretty unique most of the times, i suppose.

ashek1520 · June 7, 2021, 8:54pm

Thank you, I will look at this option.

ashek1520 · June 22, 2021, 3:17pm

Hello,

I tried regexp, the issue is that extracted entities are not all the words that are there in the lookup. I wanted a way to extract the whole part of the lookup. From my understanding I needed a way to do some featurization of whole string, and use that to find the best match from the user’s input.

I have followed steps mentioned in here https://towardsdatascience.com/give-some-semantic-love-to-your-keyword-search-c35f16df2ee I create a file ‘action_helper.py’, which has a class LookupExtractor, the class creates a model to extract full text from lookup table. In my rasa chatbot, I get the slots filled using form, where I ask user for various items in the form.

My thoughts, when bot receives a sentence from user for a form question, ValidateForm is called. in the ValidateForm, I initialize the LookupExtractor to create the embedded vector for each lookup based on model en_core_web_sm .

When validateForm is called for the lookup item, i pass the whole sentence to the lookupExtractor, and it returns the full text from lookup based on extraction. I set the extracted value to the slot “medName”

This solution works, but I believe this is not optimal

in lookupExtractor initialization, I load en_core_web_sm using spacy, the same is done by rasa pipeline (so model is loaded two times)
When validateForm is called, the slot medName is filled by NLU. Then using lookup extraction, I fill the slot again using 'return {“med_name”: med_name}`. When running rasa in interaction mode, I see the slot has two value, the original that is extracted by nlu processor, and then the value I put after lookupExtraction. which I think is not correct, because probably the older value will also influence conversation.

Is there any way, to remove the older value of slot medName, before entering the new name?

With the very minimal knowledge of Rasa, I think I need to write custom featurizer and custom classifier. Please correct me if I am wrong! If its true, please point me to a simple example or tutorial for this. I could not find one, which can be easy enough to understand it and use to create my own.

Thanks, Abhishek

Topic		Replies	Views
Lookup table is not working Rasa Open Source	15	5685	October 9, 2022
Lookup table for language that has no space to separate words Rasa Open Source	9	1322	April 6, 2020
Error while trying to create a look up table Rasa Open Source	18	2590	August 23, 2019
Look up tables are not working on unseen data samples Rasa Open Source	2	582	May 27, 2019
Adding business specific terms and acrynoms in model Rasa Open Source	1	610	May 30, 2019

How to add Lots of new words in the pre-trained data

Related topics