User shorthand entity extraction

I need to extract a username, which is in a specific format, so I assumed that this would be fairly easy. I have a few examples in the nlu file. Since the pattern did not seem to be generalized I added a regex in nlu.md and the regex featurizer in the config. After this didn’t do the trick I added more examples in a lookup table, but that didn’t help either.

Is the approach reasonable or should one use duckling for such things? If so, how to create a custom entity with duckling?

Can you provide a sample conversation? Are you using a form to input the user name?

Yes, I am using a form but that shouldn’t matter, since forms only change the way slots are filled. Here the entity isn’t extracted. Also a sample conversation isn’t relevant since this is only a nlu issue.

The inform intent is classified correctly but no entity is extracted. Heres some sample nlu data and the nlu config:

## intent:inform
...
- [abc2de](nt_user)
- [klr6si](nt_user)
- [ort3fe](nt_user)
- [poe9fe](nt_user)

## regex:NT-User
- [a-zA-Z]{3}[0-9][a-zA-Z]{2}

## lookup:NT-User
- hbf3fe
- kvn4si
- klr6si
- ort3fe
- poe9fe
- ewf7fe
- ief1fe
- rok9fe
- prr3si
- qde1si
pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100

I think Duckling would not work here. It is a system that uses (among other things) heuristics that are known to work for somewhat common labels. It is therefore limited to a set of predifined dimensions. You can find them defined here.

Ok, so a current workaround for me is to fill the slot from text (in the slot_mappings method of the FormAction) and try to match the regex in the validate_user_name method.

Anyway I don’t consider this a permanent solution and I really wonder why this doesn’t work in the NLU part. Can this simply be an issue of amount of training data? (I currently have about 25 examples)

I’ll quote the documentation on the RegexFeaturizer;

For each regex, a feature will be set marking whether this expression was found in the user message or not. All features will later be fed into an intent classifier / entity extractor to simplify classification (assuming the classifier has learned during the training phase, that this set feature indicates a certain intent / entity). Regex features for entity extraction are currently only supported by the CRFEntityExtractor and the DIETClassifier components!

The featurizer generates a binary sparse feature that goes into the model. It may very well be though that this one tiny feature is drowning among all the other features that are generated and that the model is having a hard time picking it up as the proper signal. This is my gut feeling at least. There’s a similar thing happening with the lookup table.

As a potential fix, I guess you could try turning down the max_ngram to 2-3 just to see if it helps.

It also deserves mentioning that we’re working on a feature that allows you to define a lookup table that won’t just cause features but will actually cause entities to appear. It will be called the LookupEntityExtractor.

Thanks for the answer.

Unfortunately decreasing max_ngram didn’t make a big difference. Neither did increasing the number of examples in the lookup.

In LookupEntityExtractor tabergma mentioned that a RegexEntityExtractor is planned. Anyway I don’t believe this will be released soon enough since my project is limited in time. Someone posted a custom component for that here. But I can’t get it to run, it fails on the import

>>> from rasa.nlu.extractors import EntityExtractor
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'EntityExtractor'
>>> import rasa.nlu.extractors
>>> dir(rasa.nlu.extractors)
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']

It finds the package but doesn’t have EntityExtractor name bound. I checked and the files are in the respective site-packages directory.

(rasa) ***@***:~$ ls ~/venvs/rasa/lib/python3.6/site-packages/rasa/nlu/extractors/
__init__.py              duckling_http_extractor.py  mitie_entity_extractor.py
__pycache__              entity_synonyms.py          spacy_entity_extractor.py
crf_entity_extractor.py  extractor.py

I’m on the latest release 1.10.8.

Do you have an idea how to make it work?