I am using NLU to extract pickup and dropoff entities for a quoting system. Currently I am relying on the CRF extractor to get custom entities for two reasons: 1.) the locations aren’t in standard nlp packages afaik (they’re aussie suburbs and postcodes) and 2.) i don’t want to assume people will properly capitalise places. Thus I found spacy by itself doesn’t perform that well. I also need to allow for typos (making exact match databsae queries impossible, and fuzzy matching is slow over the whole db.
There are several hurdles i am trying to overcome and would appreciate input on.
1.) The database is 16k locations, some of which are multiword and contain proper english words (i.e. Dover Heights). Training NLU on 16k examples is not hard (though this assumes it sees each once), however the diversity of example sentences becomes a problem - i certainly cannot think of even 1000 different sentences to train on to avoid overfitting (though one may argue this means its not a problem because there probably aren’t that many ways people will ask for things). I’m looking into using this GitHub - rodrigopivi/Chatito: 🎯🗯 Generate datasets for AI chatbots, NLP tasks, named entity recognition or text classification models using a simple DSL! which seems super cool but i suspect this won’t solve the problem entirely, but i’m looking forward to playing with it
2.) Related to the above, there will also be examples which lack much context, for example “quote on sending to [dropoff_suburb]” or “its going to [suburb]”
Also, sentences can vary quite a bit in levels of information, for example “Hi I want a quote on sending 3 items which are at [suburb1] [postcode1] over to [suburb2][postcode2] theyre XX kgs and YY dimensions”
Or it could just have just one of the above entities and the rest have to be extracted still. Technically this isn’t a hurdle as such but thought worth mentioning.
So my broad question is: any tips? My more specific question is: what set of CRF features is best suited to this? As mentioned I cannot assume capitalisation (and in fact for standardization i am lowering all the cases which may be foolish) so the ‘title’ and so on features are probably a waste. I’ve been playing around with different feature sets but i am yet to find one that consistently performs better than another across a selection of examples.
EDIT: As an aside, I know other people on here have talked about similar issues i.e. training specialist context-specific NLU’s and that’s basically what is already being done here. There is actually a more direct method to parse these sentences: wipe out all english words and whatever is left (with some cavets) are addresses, then you just need to look at location of ‘from’ and ‘to’ words and you can extract the pickup and dropoff. The problem comes with the postcodes when you have an array of other numerical values, which is why i came back to using NLU as the former seemed to be riddled with edge cases.