Thanks for your help. The only issue that I might have here is that this is meant to interface with an asr/tts audio system, so I won’t have access to a lot of these functions since I wouldn’t want the audio to play so long. I don’t want to be constantly asking the user for verification when they have given enough information. Consider the following scenario in a fast food restaurant:
1 User: “Hi, I would like a large diet coke and a hamburger with extra cheese, extra pickles, and no lettuce”
2 Bot: “Hamburger with extra cheese, pickles, no lettuce and a large diet coke, anything else?”
3 User: “Yeah, I want large fries with my order with ketchup and special sauce.”
4 Bot: “Large fries with ketchup and special sauce, will that be all today?”
5 User: “Actually, I would like a milkshake instead of my coke and could you add a hashbrown on the side.”
6 Bot: “Okay, so I’ve removed the coke and added a milkshake and a hashbrown, what size and flavor would you like your milkshake to be?”
7 User: “How about a large chocolate milkshake”
8 Bot: “Okay, I’ve got a large chocolate milkshake. Is there anything else?”
9 User: “No, that’s it”
Notice how I don’t really have that much real estate for listing out all the stuff that my bot can do since it will be an audio system rather than a text system. I am trying to minimize the number of interactions that the user will need to have with the bot.
In this case, lines 1-4 are super simple, we just need some intent like “add_to_order” and a response that confirms the order from the bot. However, consider line 5. We would need the bot to understand that the user intends to swap the diet coke with a milkshake AND add a hashbrown to the order. In this case, there are 2 design options that I am mulling over:
- Have a universal “edit_order” intent for any action that involves adding things to the order. So this would be anything that isn’t responding to a form or asking a question or greetings etc. Here, we would need to use something like role classfication, so we would need the labeling to be like
Actually, I would like a [milkshake]{“entity”: “item”, “role”: “old_swap_item”} instead of my [coke][{“entity”: “item”, “role”: “new_swap_item”} and could you add a [hashbrown]{“entity”: “item”, “role”: “new_item”} on the side
Here, I am worried about the entity labeling not enforcing the pair for “old_swap_item” and “new_swap_item”, or mislabeling the “hashbrown” entity. Here, we would also need enough training data to learn this, perhaps using lookup tables for cue words that might indicate a swapping like “actually” and “instead” or “change”.
- Have multiple intents for actions that modify an order. Here, we would have “add_to_order”, “edit_order”. We might even split up the intents even more to include something like “modifying_order” like asking for extra sauce on your burger and “remove_order” for removing a side from your order. Here tho, it is hard for me to think about how the bot should interpret “Actually, I would like a milkshake instead of my coke”. Should it be modeled as removing the coke and adding a milkshake or should they be explicitly paired up? Also, here, I will still have the issue of role classification, since knowing that the user wants to both add to their order and change something in their order doesn’t tell me which entities correspond with which.
So far, I have been getting pretty far by just having a hard coded rule-based system where I just identify entities and swaps and deletes with a giant dictionary of cue words that indicate what the user wants to do, but this won’t scale well and I do think that it kinda negates the point of using ML for intent classification and entity extraction.