Advice for a NER component to recognize a *very* large set of entities with their own grammar

While I was initially researching LSTM based NER components, after reading How can I use a custom model for Named Entity Recognition? (and Entity extraction with the new lookup table feature in Rasa NLU) I realized that, well, I needed some more advice.

In short, while the recently released lookup table feature is attractive, it’s not a satisfactory solution for the use case I’m working on: I have a list of “things” that has roughly 20-40k values and these need to be “fuzzily” matched. These “things” are neither locations or dates but do have their own syntax/grammar. Because users will also not know the 20-40k list off the top of their head, this NER component needs to be fuzzy.

My questions are:

  • what type of settings on the NER CRF component would help the most?
  • LSTM, given that it’s fuzzier, seemed like it might be better. Any thoughts about that? Any LSTM implementations that anyone would recommend?
  • Am I missing another NER algorithm/type that anyone would suggest?

Tyler, your input would be very much appreciated, given that you seem to have looked at variants of this problem before!



First, a few points:

With 20-40k elements, I imagine it might be difficult to construct and train a training set that can cover all of these with good accuracy. So without knowing much about the problem, it’s hard to say the best approach, but I think it’s worth trying to think about which features are the most indicative of these elements. For example, do they often end in similar suffixes? Do they often share character ngrams? for example. If you can come up with some features like this, you may be able to use the NER CRF without the lookup tables.

The lookup tables work by searching for exact matches (with word boundaries) using a regex pattern matcher. However, if you remove the word boundary tokens within the regex pattern, then it will also search for lookup tables within words. As we mention in the blog post, with a lookup table containing character ngrams, you can use subword features in the NER CRF. This may be worth a try, as I believe we had evidence that this was somewhat promising for fuzzy entity matching. However, you’d need to first generate a list of suitable character ngrams from your set of entities. In the blog post, we describe one way to do this using randomized logistic regression The code is not terribly complex and we might be able to share a basic notebook demo with you.

We tried using fuzzy matching lookup tables but these were incredibly slow to train and match. With 20k elements I’m afraid you’d be out of luck there unless you can find a super efficient way to implement it or can reduce your list to 100s of elements.

To answer your more specific questions:

  • I would imagine that prefix and suffix features of various lengths would be the best bet for inclusion in the NER CRF. These should be able to match some common features in your entities as long as you supply sufficient training examples.

  • We tried some variants of CNN and LSTM to do entity extraction on the character level, but found that it didn’t work… However, its possible that this could work with much more tweaking. I wrote a custom code in Keras, and i’d recommend going that route as well, unless a quick search of ‘LSTM entity extraction’ gives you some promising results.

  • We also tried NCRF++ which is basically a hybrid of word level CRF and character level RNN models. You can find that here: GitHub - jiesutd/NCRFpp: NCRF++, an Open-source Neural Sequence Labeling Toolkit. It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components. (code for COLING/ACL 2018 paper) While this worked alright, it had some disadvantages compared to crf_suite in that it was slower to train and more data-hungry. Worth a shot.

As I dont work at Rasa anymore, if you want more info feel free to contact me directly or make an issue on rasa_nlu? @amn41, thoughts on this? My github is here: twhughes (Tyler Hughes) · GitHub where you can find more contact info.

Hope this helps a bit.

1 Like