First, a few points:
With 20-40k elements, I imagine it might be difficult to construct and train a training set that can cover all of these with good accuracy. So without knowing much about the problem, it’s hard to say the best approach, but I think it’s worth trying to think about which features are the most indicative of these elements. For example, do they often end in similar suffixes? Do they often share character ngrams? for example. If you can come up with some features like this, you may be able to use the NER CRF without the lookup tables.
The lookup tables work by searching for exact matches (with word boundaries) using a regex pattern matcher. However, if you remove the word boundary tokens within the regex pattern, then it will also search for lookup tables within words. As we mention in the blog post, with a lookup table containing character ngrams, you can use subword features in the NER CRF. This may be worth a try, as I believe we had evidence that this was somewhat promising for fuzzy entity matching. However, you’d need to first generate a list of suitable character ngrams from your set of entities. In the blog post, we describe one way to do this using randomized logistic regression The code is not terribly complex and we might be able to share a basic notebook demo with you.
We tried using fuzzy matching lookup tables but these were incredibly slow to train and match. With 20k elements I’m afraid you’d be out of luck there unless you can find a super efficient way to implement it or can reduce your list to 100s of elements.
To answer your more specific questions:
I would imagine that prefix and suffix features of various lengths would be the best bet for inclusion in the NER CRF. These should be able to match some common features in your entities as long as you supply sufficient training examples.
We tried some variants of CNN and LSTM to do entity extraction on the character level, but found that it didn’t work… However, its possible that this could work with much more tweaking. I wrote a custom code in Keras, and i’d recommend going that route as well, unless a quick search of ‘LSTM entity extraction’ gives you some promising results.
We also tried NCRF++ which is basically a hybrid of word level CRF and character level RNN models. You can find that here: GitHub - jiesutd/NCRFpp: NCRF++, an Open-source Neural Sequence Labeling Toolkit. It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components. (code for COLING/ACL 2018 paper) While this worked alright, it had some disadvantages compared to crf_suite in that it was slower to train and more data-hungry. Worth a shot.
As I dont work at Rasa anymore, if you want more info feel free to contact me directly or make an issue on rasa_nlu? @amn41, thoughts on this? My github is here: twhughes (Tyler Hughes) · GitHub where you can find more contact info.
Hope this helps a bit.