User shorthand entity extraction

Vale_Boca · July 17, 2020, 11:59am

I need to extract a username, which is in a specific format, so I assumed that this would be fairly easy. I have a few examples in the nlu file. Since the pattern did not seem to be generalized I added a regex in nlu.md and the regex featurizer in the config. After this didn’t do the trick I added more examples in a lookup table, but that didn’t help either.

Is the approach reasonable or should one use duckling for such things? If so, how to create a custom entity with duckling?

samscudder · July 17, 2020, 12:45pm

Can you provide a sample conversation? Are you using a form to input the user name?

Vale_Boca · July 20, 2020, 8:00am

Yes, I am using a form but that shouldn’t matter, since forms only change the way slots are filled. Here the entity isn’t extracted. Also a sample conversation isn’t relevant since this is only a nlu issue.

The inform intent is classified correctly but no entity is extracted. Heres some sample nlu data and the nlu config:

## intent:inform
...
- [abc2de](nt_user)
- [klr6si](nt_user)
- [ort3fe](nt_user)
- [poe9fe](nt_user)

## regex:NT-User
- [a-zA-Z]{3}[0-9][a-zA-Z]{2}

## lookup:NT-User
- hbf3fe
- kvn4si
- klr6si
- ort3fe
- poe9fe
- ewf7fe
- ief1fe
- rok9fe
- prr3si
- qde1si

pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100

koaning · July 20, 2020, 9:15am

I think Duckling would not work here. It is a system that uses (among other things) heuristics that are known to work for somewhat common labels. It is therefore limited to a set of predifined dimensions. You can find them defined here.

Vale_Boca · July 20, 2020, 10:41am

Ok, so a current workaround for me is to fill the slot from text (in the slot_mappings method of the FormAction) and try to match the regex in the validate_user_name method.

Anyway I don’t consider this a permanent solution and I really wonder why this doesn’t work in the NLU part. Can this simply be an issue of amount of training data? (I currently have about 25 examples)

koaning · July 21, 2020, 9:17am

I’ll quote the documentation on the RegexFeaturizer;

For each regex, a feature will be set marking whether this expression was found in the user message or not. All features will later be fed into an intent classifier / entity extractor to simplify classification (assuming the classifier has learned during the training phase, that this set feature indicates a certain intent / entity). Regex features for entity extraction are currently only supported by the CRFEntityExtractor and the DIETClassifier components!

The featurizer generates a binary sparse feature that goes into the model. It may very well be though that this one tiny feature is drowning among all the other features that are generated and that the model is having a hard time picking it up as the proper signal. This is my gut feeling at least. There’s a similar thing happening with the lookup table.

As a potential fix, I guess you could try turning down the max_ngram to 2-3 just to see if it helps.

koaning · July 21, 2020, 9:19am

It also deserves mentioning that we’re working on a feature that allows you to define a lookup table that won’t just cause features but will actually cause entities to appear. It will be called the LookupEntityExtractor.

Vale_Boca · July 21, 2020, 12:37pm

Thanks for the answer.

Unfortunately decreasing max_ngram didn’t make a big difference. Neither did increasing the number of examples in the lookup.

In LookupEntityExtractor tabergma mentioned that a RegexEntityExtractor is planned. Anyway I don’t believe this will be released soon enough since my project is limited in time. Someone posted a custom component for that here. But I can’t get it to run, it fails on the import

>>> from rasa.nlu.extractors import EntityExtractor
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'EntityExtractor'
>>> import rasa.nlu.extractors
>>> dir(rasa.nlu.extractors)
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']

It finds the package but doesn’t have EntityExtractor name bound. I checked and the files are in the respective site-packages directory.

(rasa) ***@***:~$ ls ~/venvs/rasa/lib/python3.6/site-packages/rasa/nlu/extractors/
__init__.py              duckling_http_extractor.py  mitie_entity_extractor.py
__pycache__              entity_synonyms.py          spacy_entity_extractor.py
crf_entity_extractor.py  extractor.py

I’m on the latest release 1.10.8.

Do you have an idea how to make it work?

Topic		Replies	Views
Similar Entity Extraction Rasa Open Source	18	2504	October 26, 2018
RegexEntityExtractor Slot filling not working in Rasa 3.x Rasa Open Source	1	401	October 28, 2022
Extract alphanumeric entity Rasa Open Source	3	724	October 31, 2018
No Regex Entity Extraction Getting Started with Rasa	2	228	February 16, 2021
Improving Extraction of Alphanumeric Entity Rasa Open Source	8	1866	June 30, 2019

User shorthand entity extraction

Related topics