How to deal with overlapping entities in training samples?

I am building a bot in real estate and I am trying to extract entities at two level at the same time: apartment type and number of specific rooms (bedrooms, bathrooms, etc.)

In our domain, people usually refer to apartments with phrases like:

  • I am looking for a 2 bed 2 bath apt
  • Looking for a 2x2 apt
  • I need a studio or a 1 bedroom
  • I need a 2b/2b
  • Looking for a room in a shared apartment

My idea was to extract three entities:

  • unit_type: studio, full apartment, shared apartment
  • bedrooms: 1, 2, 3, etc.
  • bathrooms: 1, 1.5, 2, etc.

However, in many cases, these entities overlap and I am therefore not sure how to create the training samples nor if Rasa will be able to handle the overlapping entities.

For example, the same sentence could be annotated as:

  • I am looking for a (2 bed 2 bath apt)[unit_type:full apartment]
  • I am looking for a (2 bed)[bedrooms:1] (2 bath)[bathrooms:2] apt

Should I have two different training examples? Can I somehow merge them in one? And is there a better way of handling the numbers themselves?

Thanks a lot for your help! Nicolas

Hi Nicholas,

Personally, i don’t think having 2 training examples (same sentence but with different annotations) is a good idea, seems like that will just confuse the bot.

About the unit type, can you ask the user for that information in a seperate question ? For example:

User: I'm looking for a [2 bed](bedroom) [2 bath](bathroom) apt
Bot: I see, please tell me what type of apartment that you want:
        - Studio
        - Full apartment
        - Shared apartment
User: A [studio](unit type) please

In addition, you can do some fancy logic in your actions like check if bedroom and bathroom both got filled. If they did then you know the user want a full apartment and the bot won’t ask for it anymore (by setting the unit type slot yourself if you are using FormAction).

You can train another CRFEntityExtractor and run it as a server.

Thanks for your inputs. In the end, I decided to split the entities differently. For interested people, I did the following in the end:

I'm looking for a [2](bedrooms_count) [bed](room:bedroom) [2](bathrooms_count) [bath](room:bathroom) [apt](unit_type:full)

I still think it would be useful to be able to extract overlapping entities without having two different CRF Entity Extractors.

One thing I noticed is that the JSON format would allow for overlapping entities in the training data but not the markdown format. For example, annotating overlapping entities in Rasa X produce correct JSON but incorrect markdowns as the text is duplicated.