Entity tagging for large datasets

k-koehler · September 14, 2018, 7:13pm

sklearn-crfsuite performs very well but trains very slowly on large datasets. For example, on a dataset with ~200k examples, with max_iterations=500 (seems to give the best performance) I get training time > 4 hours because sklearn-crfsuite does not have parallelization. I’m using an ec2 on AWS with 8 cpus to train the model and I get %12.5 cpu usage for the entire entity training process.

It would be ideal if RASA offered an entity tagger which could be sped-up with GPUs or parallelization. There are several options here including BiLSTM networks.

I’m definitely willing to help with this project should other people feel the same way.

akelad · September 17, 2018, 11:19am

have you tried the tensorflow embedding pipeline?

datistiquo · September 17, 2018, 11:23am

@akelad I always read NER within tensorflow embedding context. He is asking that NER_CRF might be slow. What does the tensorflow embeding pipeline has to do with this? I know that this pipeline is now independent of spacy. But how is this correlated to the independent CRF pipeline?

akelad · September 18, 2018, 3:54pm

yes i’m aware of what he’s asking, when using the spacy pipeline you may be using features of spacy in the ner_crf. Also the problem is probably not related to entity extraction, but intent classification

JoeTorino · September 20, 2018, 12:06pm

I tried using the pre-made template “tensorflow embedding pipeline” on a small data set and found that it did not perform well on entity extraction. It doesn’t seem to have much related to entity extraction.

I want to use tensor flow for intent classification as I noticed that even with small training sets it performed well however I am still undecided as to which NER methods/libraries I should use.

The current pipeline I am working with is the following:

language: “it” pipeline:

name: “nlp_spacy”
name: “tokenizer_spacy”
name: “ner_crf”
name: “tokenizer_whitespace”
name: “intent_featurizer_count_vectors”
name: “intent_classifier_tensorflow_embedding” intent_tokenization_flag: true intent_split_symbol: “+”

My team would prefer to avoid using Spacy if possible however it seems to perform well on entity extraction, do you have any suggestions?

datistiquo · September 20, 2018, 1:29pm

Yeah, It would be only good for entity which are spelled in a similiar way since you use only the word itself or ngrams as features… I don’t know if I should train same sentences with different entities or different sentences with one or few entities. which way is better? Or shall I train every sentences type with a group of entities. I guess this would be better.

How many differnet entity values I need per sentence structure as a rule of thumb? @akelad

akelad · September 21, 2018, 1:31pm

@JoeTorino i don’t think entity extraction with spacy is going to work better than the non spacy version with italian how much training data do you have so far? also have you run the evaluate script using the two different pipelines?

@datistiquo it doesn’t just pay attention to the word itself, it also looks at the surrounding words. you should do both, train different sentences with different entities. as for a rule of thumb, i can’t really give you answer for that as it depends what your training data looks like

JoeTorino · September 24, 2018, 12:19pm

I think you can tell but I’m relatively new to Rasa. We don’t really have a lot of data to work with yet which is rather annoying as I am aware it makes it harder to compare the different pipelines since it doesn’t really leave much room to add different intents or entities.

I have been using Spacy in my pipeline so far however I might change it in the future, from reading around I have hear that Duckling is quite good at entity extraction however I’m not sure if it is available for use in Italian.

akelad · September 26, 2018, 9:59am

I really wouldn’t recommend using spacy for italian, because the model spacy provides doesn’t have any pretrained word vectors in it. So I’d suggest adding some more training data and switching to tensorflow asap

JoeTorino · October 1, 2018, 9:50am

Do you recommend that I switch to TensorFlow also for Entity Recognition?

When I use your pre-configured template it works well with intent recognition but from what I saw it doesn’t have any entity recognition components.

akelad · October 2, 2018, 8:36am

So the ner_crf works independently of tensorflow and spacy. The only time you need spacy for ner_crf is if you use the pos feature. The preconfigured tensorflow embedding pipeline contains ner_crf by default Choosing a Rasa NLU Pipeline

Zylatis · May 17, 2019, 6:31am

I’d like to clarify some things in this convo as i have much the same question as OP.

My understanding is that currently the tensorflow embedding only concerns intent classification and that it has naught to do with entity extraction and ner_crf stuff. In my case I have a large dataset ~80k because i have difficult entities to tag, the intent is easy, so if my previous statement is true then GPU gains for TF won’t help at all.

Is this correct? Is there any way to speed up the ner_crf part?

EDIT: Also, stupid question: when the SpacyNLP component is training in the NLU training part, what is actually occurring? For example, when all i do is change from using sm to web_lg spacy model the entities i get from the crf are better, why is this? The only ones NLU returns are from CRF not spacy, as far as i can tell

EDIT 2: Ohh okay i think i get it - the crf uses POS stuff from spacy?

Topic		Replies	Views
Entity extraction in multi-intents? Rasa Open Source	6	1071	August 29, 2018
Leveraging both spaCy and CRF entity extraction correctly Rasa Open Source	8	4927	February 18, 2020
CRFEntityExtractor how much time take to complete Welcome to the Rasa Community Forum!	5	875	December 31, 2019
How to train CRFEntityExtractor faster? Rasa Open Source	3	851	December 20, 2019
Rasa_NLU ner_crf classification issue Rasa Open Source	1	495	June 12, 2019

Entity tagging for large datasets

Related topics