I’m trying to do address extraction from free text and would like to understand better how the lookup table impacts the CRF so I can format things optimally.
In the simplest case of each token being a single word, i understand how we can add a feature to the CRF input which just says ‘is this word in a list’. In that case it makes sense to tag the single word, and have that single word in the look up table/list.
For multiword stuff I’m unsure. For example, in the address example in this post
I might have ‘123 Washington St’ and ‘Washington St’ both as valid address entities, but which should be included in the lookup? Is it sufficient to just include Washington? After all, it’s just an input feature flag not a hard lookup, so the goal is just to tell the CRF this part is more likely to be an address than not, and the CRF should propagate this probability to neighbouring tokens, right?
What I am getting at is my data will contain ‘addresses’ but they won’t always contain the same numbers of words or types of items, i.e. some may contain postcodes others may not (I have tried having entities for each component but this gets squiffy pretty quickly).
So I guess my question is how do I think about the look up featuriser working on multiword entities and which approach is on balance most generalisable? (including each permutation/combination explicitly in the lookup table, or simply breaking down everything into individual words and including only them in the lookup)
Thanks, Z
EDIT: As a followup query, is it correct that the way the lookup tables work is they add a feature to the CRF by going through the regex featuriser?