Hi there -
My name is Sekhar - just started exploring the rasa stack recently. So I’m completely new to this stack. Also, I’m yet to install the stack in my VM. I’m writing to this forum to understand if the stack can help me in a specific requirement / use case with Life Sciences data. I hope someone can shed light in finding an answer to my questions below:
Firstly, my requirement has got nothing related to a chatbot. But I think I can still deploy a chatbot and ask the bot to look at the data (documents) and classify them for me.
Here are the requirements:
-
I have 100K documents (all searchable PDFs) saved in a storage repository for training the rasa stack.
-
There are 5 million production documents (all searchable PDFs) for document classification. These documents can be considered as the real testing documents for the model to classify. The classification intent for these 5 million documents will fall under either: (a) Submission, (b) Correspondence. In other words, the 5 million PDFs is a mix of ‘Submission’ and ‘Correspondence’ documents.
-
The 100K documents (refer 1 above) have a few labeled data elements in an Excel file for each document…
-
We can use this 100K training data set (labeled data elements) to train the rasa NLU. I can create an appropriate markdown file and a config file to pass to the training module.
-
The need is to classify the 5 million documents (refer 2 above) using the trained model. This means that - I ask a question to the bot say, "Show me all those file names that can be classified as ‘Submission’ ". Then the bot looks at the 5 million documents, runs the trained model, and displays those file names that can be considered as ‘Submission’ related files.
-
Finally, click a button that will copy the file names to an Excel file.
Question: Is it possible to achieve the above requirements completely by just using rasa stack? If yes, can you kindly describe the approach to solve this problem? Also, is rasa really meant for such use cases out-of-the-box?
Many thanks, Sekhar H.