Improving Extraction of Alphanumeric Entity

Hi everyone,

I was wondering if there is anything I can do to improve extraction of an alphanumeric entity. We are currently working with shipping containers ids and although the intent is recognized, when the entity does not match other examples on the traning nlu data the slot returns empty. Is there anything I can do to improve that?

Additional information:

  • There already are examples to our nlu training data

  • There is on our pipeline entity_featurizer_regex

  • Exemple of what’s in the nlu.md:

    ## intent:inform_container
     - [NYKU3154360](NROCONTAINER)
     - [CAIU38699623](NROCONTAINER)
     - meu número é [YMLU5387323](NROCONTAINER)
     - meu número é [SEGU4074485](NROCONTAINER)
    ## regex:NROCONTAINER`      
     - [a-zA-Z]{3}[uU]{1}[0-9]{7}
    
  • domain.yml:

     intents:
       - inform_container: {use_entities: NROCONTAINER}        
     entities:
         - NROCONTAINER 
     slots:
        NROCONTAINER: 
        type: text
    

Thanks in advance! :relaxed:

Try this PR,

It has been merged last month, You could do a NER with phrase matcher using REGEX

Hey @souvikg10, thanks for the reply!

I’ve already supplied a lookup table in the training data, but thing is I’m still having trouble how to go about this :sweat: Does that solution you provided could perhaps help with this?

Container ids do have a limited range of an specific 4 letter combination, that a lookup table could help, but the problem is that the other 7 number values that come along have can vary.

So an user could type NYKU3154360 where I do have that example on my training data and lookup table, but could also type those numbers in an entirely different way.

(thank you again for the help!)

Indeed you are right, the PR allows regex featurization of the lookup table.

This isn’t the right solution for you

Did you try providing regex patterns for a particular entity in the training data?

Sorry, i finally checked your issue, i was checking the issue from my phone. I noticed you have provided the regex in your training data

Normally you have what is needed for regex pattern matching but indeed when you add the pattern in the training data, it just “improves” the entity extraction however, it is still a NER CRF, did you try some CRF features to improve the chances for detecting an entity?

I would recommend using a custom extractor for regex patterns because ner crf isn’t absolutely reliable

I did try previously providing the pattern :sweat: I’ve followed those instructions about the training data format too

Reading what you said about regex featurization of the lookup table actually gave me an idea. but since I’m new to this I’m not sure it’s practical. Could you perhaps help giving me an oppinion about it?

The idea would be an intent just for the 4 letters - that would work with a lookup table, an another intent just for the numbers. So there would be two intents (to inform letters and inform numbers) and two slots (letters and numbers). Not sure if that would make things more complicated (specially because I’m using a custom action using the container id slot to retrieve information from API).

I’ve also read about custom dimensions in Duckling, do you know if that could perhaps help too?

Thank you again for all the help, @souvikg10 ^^

Your idea might confuse your classifier even further, the first one

Indeed you can put a custom dimension in duckling, bear in mind it is written in haskell so it is quite a learning curve.

Simplest would be to add a regex extractor in your pipeline you save your patterns linked to an entity in a json file. load the json on parse and do re.match().

python has a very good regex library, you can even generalize this for many different entities and you don’t need to train these entities using a CRF

2 Likes

Indeed learning from scratch the way throught haskell and making a custom dimension seems a more time consuming idea, a custom extractor for regex patterns does sound like a really great and better one! :relaxed:

Like you said, that could actually also be super useful with different entities we’ll have to use also in the future. Thank you so much for that, @souvikg10! You really, really helped a lot! ^^

1 Like

@carla.lmeida, if you haven’t built custom regex extractor component, I have written something that might help you: RASA Regex Entity Extraction - Naoko - Medium