How to add Lots of new words in the pre-trained data

Hello,

I am very new to Rasa and NLP, spent reading documentation and video and finally have gathered courage to dirty my hands with chatbot creation. I am trying to make a chatbot for medical shop. Which will provide information about medicine and suggest on doses etc based on person’s profile.

I have a basic chat bot ready, but the issue is that chatbot is not recognizing the medicine names beyond what I have trained for. I need to tell the chatbot about all the medicines, which is a big list. I can generate a list of medicines or I can collect articles from net about medicines. But I am not sure how to train the chatbot for this list. At present my pipeline looks like this (same as created by default)

pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 100 constrain_similarities: true
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100 constrain_similarities: true
  • name: FallbackClassifier threshold: 0.3 ambiguity_threshold: 0.1

As messages from user will be in English, i want to continue using pre-trained data, but I wish to amend the pre-trained data with my domain specific words. Is there any other way to do it, I thought of using lookup table, but not sure if this will be a good idea, given there can be thousands of medicines and people can write name in some different orders.

Please guide me. Thanks, Abhishek

Hi !

For this, I’m not an expert but I use lookup table ; I already build a lookup table with “French firstnames list of 2019”, so some thousands of example, and it works well to detect entity, associated with some intent training :slight_smile:

1 Like

Thanks, I will try that. Did you experience any issue related to “multiple words” in one entry in the lookup table, example.

Because i am suspecting many such cases in my lookup table.

for large lists, check Rasa NLU Examples

Flash Text is generally more preferable to use with lookup tables with large number of data over Regex. You can also check this article about why

https://alibaba-cloud.medium.com/why-you-should-use-flashtext-instead-of-regex-for-data-analysis-960a0dc96c6a

In the Rasa NLU examples repo… FlashText is used for Lookups of Large List that uses an exact matches. I would imagine for names of Medicines, it is pretty unique most of the times, i suppose.

Thank you, I will look at this option.