sklearn-crfsuite performs very well but trains very slowly on large datasets. For example, on a dataset with ~200k examples, with
max_iterations=500 (seems to give the best performance) I get training time > 4 hours because
sklearn-crfsuite does not have parallelization. I’m using an ec2 on AWS with 8 cpus to train the model and I get %12.5 cpu usage for the entire entity training process.
It would be ideal if RASA offered an entity tagger which could be sped-up with GPUs or parallelization. There are several options here including BiLSTM networks.
I’m definitely willing to help with this project should other people feel the same way.