Product names are typically very long in ecommerce sites. for eg : “EYEBOGLER V-Neck Shawl Collar Stylish Men’s Solid T-Shirt”
the user might not even give the entire product names in the conversations. Also typos are expected too.
Option 1 : use lots of training examples for the model to learn. the problem is model might overfit to the programatically generated examples.
Option 2: use lookup tables to list down all product names. A regex match is done in this case. Here the problem is if we factor in the variations in which a user utters a product name (with only some parts of the name, with typos etc) the list can grow really big.
Which option is better to use and do we have any other way of solving this?
breaking down the product name into different entities works for a particular category like t-shirts. But how do we make it scalable for all sorts of product categories. for example smartphones, earphones, clothing etc.
I am looking for a general entity recongnition solution for any ecommerce products names belonging to various categories. for example, lets consider the flipkart products available in this public dataset Flipkart Products | Kaggle
Based on the extracted values, perform a keyword search on the product table. If there are multiple product names in the resulting search, show them to the user and ask for confirmation.
Both of the options you’ve mentioned have their pros and cons, and the choice between them depends on the trade-offs you’re willing to make and the specific constraints of your application. Additionally, there are alternative approaches you can consider as well. Let’s examine each option and explore other possibilities:
Option 1: Use Lots of Training Examples
Pros:
Can handle a wide variety of user inputs and typos.
Can provide more natural and flexible responses.
Cons:
Prone to overfitting to the training data.
Requires continuous updating as new products are added.
Option 2: Use Lookup Tables with Regex
Pros:
Can efficiently handle known product names.
Relatively straightforward to implement.
Minimizes overfitting.
Cons:
Can struggle with variations, typos, and new products not in the lookup table.
Might lead to a large lookup table as you consider variations.
Alternative Approaches:
Fuzzy String Matching: Utilize fuzzy string matching algorithms that can identify similar strings even with typos. Libraries like FuzzyWuzzy or the Levenshtein distance algorithm can help with this. This approach can bridge the gap between recognizing variations in user input and keeping the lookup table manageable.
Keyword Extraction and Entity Recognition: Implement natural language processing (NLP) techniques to extract keywords or named entities from the user input. This can help identify relevant product terms, even if they’re not an exact match.
Hybrid Approach: Combine the strengths of both options. Start with a lookup table for known product names and use a fuzzy matching algorithm to handle variations and typos. This can strike a balance between accuracy and flexibility.
User Feedback Mechanism: Implement a user feedback mechanism. When the chatbot suggests a product, allow users to confirm whether it’s the product they intended. Use this feedback to improve the system over time.
Machine Learning Models: Consider training machine learning models to recognize product names, especially if the product names change frequently or if you need to handle a large number of products. These models can learn from patterns in the data, but it’s important to manage overfitting.
Ultimately, the best approach might involve a combination of these methods. For example, you could start with a lookup table and gradually expand it using user interactions and fuzzy matching algorithms. Regularly collecting user feedback and monitoring system performance can help you fine-tune your approach over time.