Entity extraction from data in PDF

kapilkathuria · September 19, 2018, 8:52am

Our requirement is to extract data from Fund reports (.pdf files). We have 300+ variation of fund reports (from different fund houses) & hence we thought NLP might be best way to go about it rather than using RegEx. For example - we have “Monthly Report - July 2018.pdf” file (snap-shot at bottom of this post) . From this PDF file, we are interested in extracting entities as highlighted in red below:

We converted PDF to text file so that it can be used by RASA.

My queries are:

Should we be training RASA NLU only for relevant lines (i.e. lines which contains entities we need) or should RASA NLU be trained with data of whole file? Or simply RASA be trained for complete data in 1 file as 1 string?
Is RASA NLU correct way to go about out problem or are there other suitable libraries / tools available specifically for our problem (NLP or may be non-NLP, non machine learning based)?
If RASA NLU is appropriate approach, what tool can be used for data annotation?

For reference, here is snap-shot of complete page of PDF (not able to attach PDF file due to restriction)

rasa_newbie_123 · June 1, 2022, 4:16pm

@kapilkathuria hello there, did you get any solution for that issue? Because I also want to extract data from PDF file, but I do not know where to start. Today I think this could work with custom actions, but I am really a rasa beginner right now

Topic		Replies	Views
Is there a way to use NLU for batch processing data? Rasa Open Source	1	618	October 28, 2020
Entity extraction not rightly working Rasa Open Source	6	1591	October 10, 2019
Rasa NLU in Depth - Part 2: Entity Recognition Tutorials, Resources & Videos	11	4043	February 13, 2023
Dialogue management (Entity-extraction) Contributing Code	2	461	April 23, 2020
Rasa NLU extract dynamic entity Rasa Open Source	5	1316	November 1, 2019

Entity extraction from data in PDF

Related topics