How to correctly extract title and name at same time

Given a sentence such as "am i borrowing [the great gatsby]{book_title} by [F. Scott Fitzgerald]{PERSON}?", I want to extract two entities book_title and PERSON (book_author) from it. So far I tried to use DIETClassifier for extracting title of the book (which also used for extracting other entities) and SpacyEntityExtractor with dimension of PERSON for author’s name.

However I have two problems (and thus the questions) using combined entity extractors:

  1. DIETClassifier incorrectly extracts book_title. It mistakenly recognize as multiple titles, or author’s name as the book title or failed to recognize. For example, when I type "am i borrowing the devil's notebook by anton szandor lavey?", then the rasa sets entities as

entities '[{'entity': 'book_title', 'start': 15, 'end': 26, 'confidence_entity': 0.7182283997535706, 'value': "the devil's", 'extractor': 'DIETClassifier'}, {'entity': 'book_title', 'start': 45, 'end': 58, 'confidence_entity': 0.9045381546020508, 'value': 'szandor lavey', 'extractor': 'DIETClassifier'}, {'entity': 'PERSON', 'value': 'anton szandor lavey', 'start': 39, 'confidence': None, 'end': 58, 'extractor': 'SpacyEntityExtractor'}].

What should be the better way to extract the book_title in my case? Would using the lookup table be a good idea for both book title and author’s name? How large can the lookup table be?

  1. SpacyEntityExtractor does well extracting person’s name. e.g.: "am i borrowing the book called Harry Potter written by JK Rowling?" gave me a result of

'[{'entity': 'book_title', 'start': 15, 'end': 18, 'confidence_entity': 0.7037492990493774, 'value': 'the', 'extractor': 'DIETClassifier'}, {'entity': 'book_title', 'start': 31, 'end': 43, 'confidence_entity': 0.9985925555229187, 'value': 'Harry Potter', 'extractor': 'DIETClassifier'}, {'entity': 'PERSON', 'value': 'Harry Potter', 'start': 31, 'confidence': None, 'end': 43, 'extractor': 'SpacyEntityExtractor'}, {'entity': 'PERSON', 'value': 'JK Rowling', 'start': 55, 'confidence': None, 'end': 65, 'extractor': 'SpacyEntityExtractor'}]'.

However, Book title can also be tricky where it may include person’s name (e.g.: Harry Potter) which would be extracted as well by Spacy as PERSON. How can I write a logic to distinguish which is what we really want to keep as an author name or not?

Thank you for the help!

Hello @naamtokyam

In principle, DIET should be able to extract both book title and author, but it’ll need lots of training examples (I guess 1000+?). Especially book titles are difficult, because they can be quite long and variable. If this doesn’t work for you, could you make your bot ask for one thing at a time?

Hi @j.mosig Thank you! I will try with more example. So would you recommend to use DIET for the name extraction?

Actually, the user could also just enter a title without preamble or explaining what it is that they are typing. If the title is just a person’s name, how could DIET know if this is a title or an author?

You could use the RegexEntityExtractor in addition to DIET and train it with lookup tables that only include exact author names. But that’s tedious because you have to update the rasa model each time your database changes.

You could also have only one entity keyword with roles author, title, isbn, etc. and assign the role if it is clear or no role if it is not clear. Use this to train DIET and write a custom action that checks the database if it can find some matching entries. If you do this, note that we might tweak roles/groups in the future so that they are assigned by the model without looking at the entity value, so it’d only look at surrounding text to decide what the role is. This is useful in most cases, but you’d have to design your training data such that roles are only assigned whenever it can actually be inferred from surrounding words alone.