jamesmf
(jamesmf)
August 8, 2019, 2:58pm
1
A thing that seems like it would add a lot of out-of-the-box power to custom entity recognizers is the ability to pass token-level features to CRFEntityExtractor
(previously ner_crf
).
The simplest version of this would be a SpacyEntityFeaturizer
that would make a token’s .vector
attribute available to CRFEntityExtractor
. That would let you use much more powerful features in classifying your custom entities than simply part-of-speech or the other current features.
I am working on a way to add this feature and would love comments/feedback/confirmation that others want this feature.
RasaHQ:master
← jamesmf:custom_ner_features
opened 04:35AM - 07 Aug 19 UTC
This is a work-in-progress implementation that allows custom entity-featurizers … to pass features along to `CRFEntityExtractor`. This would enable new featurizers that passed word/token vectors along to `CRFEntityExtractor`. Also following this `ner_features` pattern would make it easy to develop other components that could also do custom NER.
The edit to the example is just to show that, while the `DummyNERFeaturizer` provides nothing but noise to the `CRFEntityExtractor`, it seems to be working.
Just looking for feedback before I add tests or do much else.
Addresses https://forum.rasa.com/t/feeding-custom-pretrained-embeddings-for-ner-crf/5406
**Proposed changes**:
- Make `CRFEntityExtractor` look for `ner_features` on the message
- Have `CRFEntityExtractor` convert the features from an array to `python-crfsuite` style dicts
**Status (please check what you already did)**:
- [X] made PR ready for code review
- [x] added some tests for the functionality
- [x] updated the documentation
- [x] updated the changelog
- [X] reformat files using `black` (please check [Readme](https://github.com/RasaHQ/rasa_nlu#code-style) for instructions)
Ghostvv
(Vladimir Vlasov)
August 9, 2019, 2:59pm
2
Thank you for proposing this idea. Did you perform any comparison experiments to analyze the performances?
jamesmf
(jamesmf)
August 9, 2019, 5:12pm
3
still brainstorming that. Are there any standard datasets where this would be helpful?
The restaurantbot or something similar?
jamesmf
(jamesmf)
August 11, 2019, 10:00pm
4
Here we go, updated the PR with the results from a rasa test nlu
.:
opened 05:17AM - 09 Aug 19 UTC
closed 03:59AM - 22 Jan 20 UTC
type:enhancement
**Description of Problem**:
Right now custom entities can only use `pos` featur… es from `spacy` and a handful of simple features. This seems to be in contrast to the flexibility and power of the other pipeline components which can take advantage of any combination of built-in and custom `featurizers`. Ideally, there would be a way to pass `ner_features` to the `CRFEntityExtractor`. In particular, this would let you train NER that used word/token vectors straight from spacy (or other pretrained models)
**Overview of the Solution**:
- `CRFEntityExtractor` needs to additionally check for `ner_features` on the message and add them to the feature dict it passes to `sklearn_crfsuite`.
- There need to be NER featurizer classes added
**Examples** (if relevant):
The skeleton of this (both adding a `spacy`-based featurizer and making `CRFEntityExtractor` use `ner_features`) is implemented in this PR
https://github.com/RasaHQ/rasa/pull/4187
Please let me know if this looks like a useful feature and if this PR is heading in the right direction.
Still necessary:
- Add tests
- Extend `Featurizer` to also have `_combine_with_existing_ner_features`
- Validate that having default spacy tokens noticeably improves NER for a sample task
- Make `spacy` only optionally add to `ner_features`
- Replace the hard-coded lambda functions in `CRFEntityExtractor` with a simple `Featurizer`
**Definition of Done**:
<!-- What needs to be there to consider this feature as done?
- [ ] Tests are added
- [ ] Feature described the docs
- [ ] Feature mentioned in the changelog
Ghostvv
(Vladimir Vlasov)
August 12, 2019, 1:11pm
5
thanks a lot! Let’s move the discussion to this PR. Our team will review it as soon as possible