Our requirement is to extract data from Fund reports (.pdf files). We have 300+ variation of fund reports (from different fund houses) & hence we thought NLP might be best way to go about it rather than using RegEx. For example - we have “Monthly Report - July 2018.pdf” file (snap-shot at bottom of this post) . From this PDF file, we are interested in extracting entities as highlighted in red below:
We converted PDF to text file so that it can be used by RASA.
My queries are:
- Should we be training RASA NLU only for relevant lines (i.e. lines which contains entities we need) or should RASA NLU be trained with data of whole file? Or simply RASA be trained for complete data in 1 file as 1 string?
- Is RASA NLU correct way to go about out problem or are there other suitable libraries / tools available specifically for our problem (NLP or may be non-NLP, non machine learning based)?
- If RASA NLU is appropriate approach, what tool can be used for data annotation?
For reference, here is snap-shot of complete page of PDF (not able to attach PDF file due to restriction)