Improving Extraction of Alphanumeric Entity

carla.lmeida · October 30, 2018, 3:15pm

Hi everyone,

I was wondering if there is anything I can do to improve extraction of an alphanumeric entity. We are currently working with shipping containers ids and although the intent is recognized, when the entity does not match other examples on the traning nlu data the slot returns empty. Is there anything I can do to improve that?

Additional information:

There already are examples to our nlu training data
There is on our pipeline entity_featurizer_regex

Exemple of what’s in the nlu.md:

## intent:inform_container
 - [NYKU3154360](NROCONTAINER)
 - [CAIU38699623](NROCONTAINER)
 - meu número é [YMLU5387323](NROCONTAINER)
 - meu número é [SEGU4074485](NROCONTAINER)
## regex:NROCONTAINER`      
 - [a-zA-Z]{3}[uU]{1}[0-9]{7}

domain.yml:

 intents:
   - inform_container: {use_entities: NROCONTAINER}        
 entities:
     - NROCONTAINER 
 slots:
    NROCONTAINER: 
    type: text

Thanks in advance!

souvikg10 · October 30, 2018, 3:51pm

Try this PR,

It has been merged last month, You could do a NER with phrase matcher using REGEX

github.com/RasaHQ/rasa

Regex phrase matcher

RasaHQ:master ← RasaHQ:regex_phrase_matcher

opened 12:26PM - 13 Aug 18 UTC

twhughes

+289 -12

**Proposed changes**: Lookup tables may now be specified in the training data…. Individual lookup elements may be included directly as lists of strings. Alternatively the externally supplied lookup tables may be specified in the form of external files separated by newlines. For example { "rasa_nlu_data": { "lookup_tables": [ { "name": "streets", "elements": ["main street", "washington ave", "elm street", "rocky road"] } ] } } or in markdown format: ## lookup:streets - main street - washington ave - elm street - rocky road External data files may be supplied as well. For example, ``data/lookup_tables/streets.txt`` may contain main street washington ave elm street rocky road And can be loaded in along with additional elements as: { "rasa_nlu_data": { "lookup_tables": [ { "name": "streets", "elements": "data/lookup_tables/streets.txt" } ] } } or, equivalently, in markdown format as: ## lookup:streets data/lookup_tables/streets.txt When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. The regex will only match phrases that are surrounded by word boundaries, such as spaces, newlines, commas, periods, etc. These regexes match over multiple tokens, so if ``main street`` was specified in the lookup table, this would match the tokens of ``meet me at 1223 main street`` as ``0 0 0 0 1 1``. The generated regexes are processed identically to the regular regex patterns directly specified in the training data. **Status (please check what you already did)**: - [x] made PR ready for code review - [x] added some tests for the functionality - [x] updated the documentation - [x] updated the changelog

carla.lmeida · October 30, 2018, 4:12pm

Hey @souvikg10, thanks for the reply!

I’ve already supplied a lookup table in the training data, but thing is I’m still having trouble how to go about this Does that solution you provided could perhaps help with this?

Container ids do have a limited range of an specific 4 letter combination, that a lookup table could help, but the problem is that the other 7 number values that come along have can vary.

So an user could type NYKU3154360 where I do have that example on my training data and lookup table, but could also type those numbers in an entirely different way.

(thank you again for the help!)

souvikg10 · October 30, 2018, 4:25pm

Indeed you are right, the PR allows regex featurization of the lookup table.

This isn’t the right solution for you

Did you try providing regex patterns for a particular entity in the training data?

souvikg10 · October 30, 2018, 5:39pm

Sorry, i finally checked your issue, i was checking the issue from my phone. I noticed you have provided the regex in your training data

Normally you have what is needed for regex pattern matching but indeed when you add the pattern in the training data, it just “improves” the entity extraction however, it is still a NER CRF, did you try some CRF features to improve the chances for detecting an entity?

I would recommend using a custom extractor for regex patterns because ner crf isn’t absolutely reliable

carla.lmeida · October 30, 2018, 5:39pm

I did try previously providing the pattern I’ve followed those instructions about the training data format too

Reading what you said about regex featurization of the lookup table actually gave me an idea. but since I’m new to this I’m not sure it’s practical. Could you perhaps help giving me an oppinion about it?

The idea would be an intent just for the 4 letters - that would work with a lookup table, an another intent just for the numbers. So there would be two intents (to inform letters and inform numbers) and two slots (letters and numbers). Not sure if that would make things more complicated (specially because I’m using a custom action using the container id slot to retrieve information from API).

I’ve also read about custom dimensions in Duckling, do you know if that could perhaps help too?

Thank you again for all the help, @souvikg10 ^^

souvikg10 · October 30, 2018, 5:42pm

Your idea might confuse your classifier even further, the first one

Indeed you can put a custom dimension in duckling, bear in mind it is written in haskell so it is quite a learning curve.

Simplest would be to add a regex extractor in your pipeline you save your patterns linked to an entity in a json file. load the json on parse and do re.match().

python has a very good regex library, you can even generalize this for many different entities and you don’t need to train these entities using a CRF

carla.lmeida · October 30, 2018, 6:58pm

Indeed learning from scratch the way throught haskell and making a custom dimension seems a more time consuming idea, a custom extractor for regex patterns does sound like a really great and better one!

Like you said, that could actually also be super useful with different entities we’ll have to use also in the future. Thank you so much for that, @souvikg10! You really, really helped a lot! ^^

naoko · June 30, 2019, 5:05pm

@carla.lmeida, if you haven’t built custom regex extractor component, I have written something that might help you: RASA Regex Entity Extraction - Naoko - Medium

Topic		Replies	Views
Similar Entity Extraction Rasa Open Source	18	2438	October 26, 2018
Extract alphanumeric entity Rasa Open Source	3	707	October 31, 2018
How to use regex patterns for entity recognition? Rasa Open Source	4	5246	December 4, 2022
Regex not Working for Training Data Rasa Open Source	14	2374	September 9, 2020
User shorthand entity extraction Rasa Open Source	7	547	July 21, 2020

Improving Extraction of Alphanumeric Entity

Related topics