Entity tagging for large datasets

sklearn-crfsuite performs very well but trains very slowly on large datasets. For example, on a dataset with ~200k examples, with max_iterations=500 (seems to give the best performance) I get training time > 4 hours because sklearn-crfsuite does not have parallelization. I’m using an ec2 on AWS with 8 cpus to train the model and I get %12.5 cpu usage for the entire entity training process.

It would be ideal if RASA offered an entity tagger which could be sped-up with GPUs or parallelization. There are several options here including BiLSTM networks.

I’m definitely willing to help with this project should other people feel the same way.

have you tried the tensorflow embedding pipeline?

@akelad I always read NER within tensorflow embedding context. He is asking that NER_CRF might be slow. What does the tensorflow embeding pipeline has to do with this? I know that this pipeline is now independent of spacy. But how is this correlated to the independent CRF pipeline?

yes i’m aware of what he’s asking, when using the spacy pipeline you may be using features of spacy in the ner_crf. Also the problem is probably not related to entity extraction, but intent classification

I tried using the pre-made template “tensorflow embedding pipeline” on a small data set and found that it did not perform well on entity extraction. It doesn’t seem to have much related to entity extraction.

I want to use tensor flow for intent classification as I noticed that even with small training sets it performed well however I am still undecided as to which NER methods/libraries I should use.

The current pipeline I am working with is the following:

language: “it” pipeline:

  • name: “nlp_spacy”
  • name: “tokenizer_spacy”
  • name: “ner_crf”
  • name: “tokenizer_whitespace”
  • name: “intent_featurizer_count_vectors”
  • name: “intent_classifier_tensorflow_embedding” intent_tokenization_flag: true intent_split_symbol: “+”

My team would prefer to avoid using Spacy if possible however it seems to perform well on entity extraction, do you have any suggestions?

1 Like

Yeah, It would be only good for entity which are spelled in a similiar way since you use only the word itself or ngrams as features… I don’t know if I should train same sentences with different entities or different sentences with one or few entities. which way is better? Or shall I train every sentences type with a group of entities. I guess this would be better.

How many differnet entity values I need per sentence structure as a rule of thumb? @akelad

@JoeTorino i don’t think entity extraction with spacy is going to work better than the non spacy version with italian :stuck_out_tongue: how much training data do you have so far? also have you run the evaluate script using the two different pipelines?

@datistiquo it doesn’t just pay attention to the word itself, it also looks at the surrounding words. you should do both, train different sentences with different entities. as for a rule of thumb, i can’t really give you answer for that as it depends what your training data looks like

I think you can tell but I’m relatively new to Rasa. We don’t really have a lot of data to work with yet which is rather annoying as I am aware it makes it harder to compare the different pipelines since it doesn’t really leave much room to add different intents or entities.

I have been using Spacy in my pipeline so far however I might change it in the future, from reading around I have hear that Duckling is quite good at entity extraction however I’m not sure if it is available for use in Italian.

I really wouldn’t recommend using spacy for italian, because the model spacy provides doesn’t have any pretrained word vectors in it. So I’d suggest adding some more training data and switching to tensorflow asap

Do you recommend that I switch to TensorFlow also for Entity Recognition?

When I use your pre-configured template it works well with intent recognition but from what I saw it doesn’t have any entity recognition components.

So the ner_crf works independently of tensorflow and spacy. The only time you need spacy for ner_crf is if you use the pos feature. The preconfigured tensorflow embedding pipeline contains ner_crf by default Choosing a Rasa NLU Pipeline

I’d like to clarify some things in this convo as i have much the same question as OP.

My understanding is that currently the tensorflow embedding only concerns intent classification and that it has naught to do with entity extraction and ner_crf stuff. In my case I have a large dataset ~80k because i have difficult entities to tag, the intent is easy, so if my previous statement is true then GPU gains for TF won’t help at all.

Is this correct? Is there any way to speed up the ner_crf part?

EDIT: Also, stupid question: when the SpacyNLP component is training in the NLU training part, what is actually occurring? For example, when all i do is change from using sm to web_lg spacy model the entities i get from the crf are better, why is this? The only ones NLU returns are from CRF not spacy, as far as i can tell

EDIT 2: Ohh okay i think i get it - the crf uses POS stuff from spacy?