Question about optimal lookup table usage

Zylatis · July 28, 2019, 1:49am

I’m trying to do address extraction from free text and would like to understand better how the lookup table impacts the CRF so I can format things optimally.

In the simplest case of each token being a single word, i understand how we can add a feature to the CRF input which just says ‘is this word in a list’. In that case it makes sense to tag the single word, and have that single word in the look up table/list.

For multiword stuff I’m unsure. For example, in the address example in this post

I might have ‘123 Washington St’ and ‘Washington St’ both as valid address entities, but which should be included in the lookup? Is it sufficient to just include Washington? After all, it’s just an input feature flag not a hard lookup, so the goal is just to tell the CRF this part is more likely to be an address than not, and the CRF should propagate this probability to neighbouring tokens, right?

What I am getting at is my data will contain ‘addresses’ but they won’t always contain the same numbers of words or types of items, i.e. some may contain postcodes others may not (I have tried having entities for each component but this gets squiffy pretty quickly).

So I guess my question is how do I think about the look up featuriser working on multiword entities and which approach is on balance most generalisable? (including each permutation/combination explicitly in the lookup table, or simply breaking down everything into individual words and including only them in the lookup)

Thanks, Z

EDIT: As a followup query, is it correct that the way the lookup tables work is they add a feature to the CRF by going through the regex featuriser?

Ghostvv · August 5, 2019, 11:43am

yes they add a feature using regex featurizer, so for multiword, the feature will be the same

Zylatis · August 7, 2019, 2:33am

Thanks for the info @Ghostvv! Could you please clarify what you mean by ‘the same’, perhaps with a single and multiword example?

Thanks, Z

Ghostvv · August 7, 2019, 11:06am

each regex example is featurized with one-hot encoding. I meant that both words in multiword entity will have the same feature present

Zylatis · August 7, 2019, 10:58pm

Oh okay, so essentially the lookup table will be tokenized into single words and the featuriser will have the ‘in_lookup’ = 1 for each of these words, right?

Ghostvv · August 8, 2019, 7:54am

should be. The easiest is to hack into the code and add a bit of print statements to verify it

Topic		Replies	Views
How does the lookup table in rasa_nlu work? Is there something similar to keyword_intent_classifier for entity extractors? Rasa Open Source	6	5398	August 13, 2021
Question about entity extraction on lookup table Rasa Open Source	4	795	June 24, 2019
How to use lookup tables for entity list Rasa Open Source	1	898	March 9, 2020
Use of lookup table Getting Started with Rasa	11	1065	February 27, 2020
Lookup table not working for entities with multiple words Rasa Open Source	18	2750	June 1, 2021

Question about optimal lookup table usage

Related topics