Given a sentence such as "am i borrowing [the great gatsby]{book_title} by [F. Scott Fitzgerald]{PERSON}?"
, I want to extract two entities book_title
and PERSON
(book_author) from it. So far I tried to use DIETClassifier
for extracting title of the book (which also used for extracting other entities) and SpacyEntityExtractor
with dimension of PERSON
for author’s name.
However I have two problems (and thus the questions) using combined entity extractors:
-
DIETClassifier
incorrectly extractsbook_title
. It mistakenly recognize as multiple titles, or author’s name as the book title or failed to recognize. For example, when I type"am i borrowing the devil's notebook by anton szandor lavey?"
, then the rasa sets entities as
entities '[{'entity': 'book_title', 'start': 15, 'end': 26, 'confidence_entity': 0.7182283997535706, 'value': "the devil's", 'extractor': 'DIETClassifier'}, {'entity': 'book_title', 'start': 45, 'end': 58, 'confidence_entity': 0.9045381546020508, 'value': 'szandor lavey', 'extractor': 'DIETClassifier'}, {'entity': 'PERSON', 'value': 'anton szandor lavey', 'start': 39, 'confidence': None, 'end': 58, 'extractor': 'SpacyEntityExtractor'}]
.
What should be the better way to extract the book_title
in my case?
Would using the lookup table be a good idea for both book title and author’s name? How large can the lookup table be?
-
SpacyEntityExtractor
does well extracting person’s name. e.g.:"am i borrowing the book called Harry Potter written by JK Rowling?"
gave me a result of
'[{'entity': 'book_title', 'start': 15, 'end': 18, 'confidence_entity': 0.7037492990493774, 'value': 'the', 'extractor': 'DIETClassifier'}, {'entity': 'book_title', 'start': 31, 'end': 43, 'confidence_entity': 0.9985925555229187, 'value': 'Harry Potter', 'extractor': 'DIETClassifier'}, {'entity': 'PERSON', 'value': 'Harry Potter', 'start': 31, 'confidence': None, 'end': 43, 'extractor': 'SpacyEntityExtractor'}, {'entity': 'PERSON', 'value': 'JK Rowling', 'start': 55, 'confidence': None, 'end': 65, 'extractor': 'SpacyEntityExtractor'}]'
.
However, Book title can also be tricky where it may include person’s name (e.g.: Harry Potter) which would be extracted as well by Spacy as PERSON. How can I write a logic to distinguish which is what we really want to keep as an author name or not?
Thank you for the help!