Document Classification

Hi there -

My name is Sekhar - just started exploring the rasa stack recently. So I’m completely new to this stack. Also, I’m yet to install the stack in my VM. I’m writing to this forum to understand if the stack can help me in a specific requirement / use case with Life Sciences data. I hope someone can shed light in finding an answer to my questions below:

Firstly, my requirement has got nothing related to a chatbot. But I think I can still deploy a chatbot and ask the bot to look at the data (documents) and classify them for me.

Here are the requirements:

  1. I have 100K documents (all searchable PDFs) saved in a storage repository for training the rasa stack.

  2. There are 5 million production documents (all searchable PDFs) for document classification. These documents can be considered as the real testing documents for the model to classify. The classification intent for these 5 million documents will fall under either: (a) Submission, (b) Correspondence. In other words, the 5 million PDFs is a mix of ‘Submission’ and ‘Correspondence’ documents.

  3. The 100K documents (refer 1 above) have a few labeled data elements in an Excel file for each document…

  4. We can use this 100K training data set (labeled data elements) to train the rasa NLU. I can create an appropriate markdown file and a config file to pass to the training module.

  5. The need is to classify the 5 million documents (refer 2 above) using the trained model. This means that - I ask a question to the bot say, "Show me all those file names that can be classified as ‘Submission’ ". Then the bot looks at the 5 million documents, runs the trained model, and displays those file names that can be considered as ‘Submission’ related files.

  6. Finally, click a button that will copy the file names to an Excel file.

Question: Is it possible to achieve the above requirements completely by just using rasa stack? If yes, can you kindly describe the approach to solve this problem? Also, is rasa really meant for such use cases out-of-the-box?

Many thanks, Sekhar H.

Hi Sekhar,

welcome to the Rasa forum! That’s a very interesting task. For sure you can apply Rasa to it once you have converted your data into the correct format, but the success will depend a bit on the details. How long are the documents? How many categories do you have? Are there likely to be keywords that indicate one document over another?

I would recommend starting with a much smaller dataset (maybe 1k documents to train and 1k to test) just to test the feasibility. I’d recommend using a pipeline with the countvector featurizer and the embedding intent classifier. In this example I’ve added a max_features value which limits the vocabulary size (probably a good idea with so many documents).

pipeline:
- name: "CountVectorsFeaturizer"
  "analyzer": 'word',  # use 'char' or 'char_wb' for character
  "token_pattern": r'(?u)\b\w\w+\b'
  "max_features": 2000
  "lowercase": true  # bool
- name: "EmbeddingIntentClassifier"

Let us know how it goes!