Hi, need some help regarding entity extraction. So I want to extract ‘living area’ from an input text. The input text can have multiple sentences and sometimes really long ones. Moreover, within the text there can also be similar looking entities such as floor space, plot area. The text is in german. Here is an example:
“a) massivem Einfamilienhaus mit Satteldach, bestehend aus KG, EG, OG und nicht ausgebautem DG; BJ vermutl. 1968; Wfl. ca. 116 qm, Nfl. ca. 95 qm b) einfache erdgeschossige Nebengebäude/Unterstände als ehem. Kleintierställe c) Lage: in der Nähe der westlich verlaufenden Staatsstraße ST 2208 d) es besteht ein Instandhaltungsrückstau und es bestehen Bauschäden /-mängel;”
So in the case above, the entity of interest is 116 (highlighted in bold) and not 95 (in italics). I have already tried training using the rasa nlu stack and the performance has been mixed. A lot of times, the entity is not extracted at all and then sometimes it extracts entities which look similar but are false negatives (as shown in the example above). My question is that what shall be my approach here? Important to note is that text is most of the times can not be properly split into sentences as it has a lot of abbreviations ending with fullstop eg ‘ca.’ , ‘rd.’ . And ofcourse, as previously mentioned sometimes the sentence lengths can be very long.
Any kind of input is more than appreciated.
Thanks a lot in advance
PS: I have used the rasa-nlu with spacy stack for the training
Hi, first of all thanks a lot for your response. I have not tried duckling yet. Will give it a shot. One question though: Do I need to split my training data into individual sentences and then annotate them or can I annotate a block of sentences and then the model internally will split them into sentences or is it going to take the whole block as one sentence. I sense that it is taking it as a whole block. Problem is that as I said, there is not always a natural way to split my data into sentences otherwise I would have done into myself in the data preprocessing step.
Am I understanding correctly that you’re looking for something that will always be formatted [Wfl. ca. $number]? if that’s true if you use ner_crf in your pipeline (it’s used by default in the sklearn pipeline), then given enough examples, ner_crf should be able to pick out the $number as the entity you’re looking for.
Another thing you can try is splitting a smaller part of the text containing your entity and then running that through the Rasa NLU. In the example above if you just submit the text in b) (and not the text in a) or c) ), then that should give Rasa NLU less noise and make the entity extraction better.
Thank you for your response. Unfortunately the data format is not so consistent. I am already using the ner_crf. I agree with your suggestion for splitting the data into smaller pieces and that’s what I am trying now. Unfortunately as I said before, the data is kind of unclean at times i.e. there is no consistent way of splitting it. I am trying the nlp_spacy sent tokenizer to split the data into smaller pieces and I sense it might help. Let’s see. I will keep you posted.
I have added a preprocessing step which is able to filter the noise out. So overall the entity extraction is working better now. But I am facing another issue. Given the entities I want to extract eg. Living Area, Plot Area have the same structure i.e. both are areas and both typically will be followed by square metre, it is becoming very difficult for ner_crf to differentiate. Is it because I have more examples of one particular entity over other? Or is it for some other reasons? Importantly, how can I fix this issue? Thanks a lot in advance.
I think if you give ner_crf enough e;xamples of Wfl. ca. 116 qm and Nfl. ca. 95 qm, and mark the numbers as different things, it should be able to pick them up as different entities. You can even do something as simple as just pasting the same examples in over and over again. I think that should work.