Hi, need some help regarding entity extraction. So I want to extract ‘living area’ from an input text. The input text can have multiple sentences and sometimes really long ones. Moreover, within the text there can also be similar looking entities such as floor space, plot area. The text is in german. Here is an example:
“a) massivem Einfamilienhaus mit Satteldach, bestehend aus KG, EG, OG und nicht ausgebautem DG; BJ vermutl. 1968; Wfl. ca. 116 qm, Nfl. ca. 95 qm b) einfache erdgeschossige Nebengebäude/Unterstände als ehem. Kleintierställe c) Lage: in der Nähe der westlich verlaufenden Staatsstraße ST 2208 d) es besteht ein Instandhaltungsrückstau und es bestehen Bauschäden /-mängel;”
So in the case above, the entity of interest is 116 (highlighted in bold) and not 95 (in italics). I have already tried training using the rasa nlu stack and the performance has been mixed. A lot of times, the entity is not extracted at all and then sometimes it extracts entities which look similar but are false negatives (as shown in the example above). My question is that what shall be my approach here? Important to note is that text is most of the times can not be properly split into sentences as it has a lot of abbreviations ending with fullstop eg ‘ca.’ , ‘rd.’ . And ofcourse, as previously mentioned sometimes the sentence lengths can be very long.
Any kind of input is more than appreciated. Thanks a lot in advance
PS: I have used the rasa-nlu with spacy stack for the training