CrfExtractor Pipeline

boutaina · March 19, 2021, 10:49am

Hello everyone , I tried to train my model with ner crf Extractor to extract some custom entities using a custom whitespace tokenizer but it gives the bellow error :
‘CRFEntityExtractor’ requires [‘Tokenizer’]

And also when I try to use the normal whitespacetokenizer it still throws the same error , should I use a specific tokenizer or is the pipeline wrong , can you please help : Here is my pipeline

language: “en”

pipeline:

name: “WhitespaceTokenizer”
name: “CRFEntityExtractor”
name: “ner_synonyms”
name: “CountVectorsFeaturizer”

“OOV_token”: “oov”
name: “intent_featurizer_count_vectors”
name: “intent_classifier_tensorflow_embedding”

souvikg10 · March 19, 2021, 12:43pm

what is in your custom whitespace tokenizer??

boutaina · March 19, 2021, 1:57pm

Hello , the custom whitespace tokenizer is working great it looks like this :

class WhitespaceTokenizer_ar_cdg(Tokenizer, Component):

provides = [TOKENS_NAMES[attribute] for attribute in MESSAGE_ATTRIBUTES]

def unique_words(lines):

    return set(chain(*(line.split() for line in lines if line)))



dict= {}



defaults = {

    # text will be tokenized with case sensitive as default

    "case_sensitive": True

}

def __init__(self, component_config: Dict[Text, Any] = None) -> None:

    """Construct a new tokenizer using the WhitespaceTokenizer framework."""

    super(WhitespaceTokenizer_ar_cdg, self).__init__(component_config)

    self.case_sensitive = self.component_config["case_sensitive"]

The same as the original whitespacetokenizer I just added some code , that’s not the issue cause it’s working fine if I remove the CRFextractor . Once I add this specific extractor , it throws the error and even when I use just the normal whitespacetokenizer it doesnt work , still gives me this error ‘CRFEntityExtractor’ requires [‘Tokenizer’] as I showed below .

souvikg10 · March 19, 2021, 2:59pm

i am not really sure, but you are inherting the Component class in your Tokenizer?

could you simply inherit -

class WhitespaceTokenizer_ar_cdg(Tokenizer):

because the Tokenizer class already inherits the components

github.com

RasaHQ/rasa/blob/8f8da24b486e8d9fc66d5b13ab176ad1f40b6a03/rasa/nlu/tokenizers/tokenizer.py#L64


      
                  if not isinstance(other, Token):
                      return NotImplemented
                  return (self.start, self.end, self.text, self.lemma) < (
                      other.start,
                      other.end,
                      other.text,
                      other.lemma,
                  )
          
          
          class Tokenizer(Component):
              def __init__(self, component_config: Dict[Text, Any] = None) -> None:
                  """Construct a new tokenizer using the WhitespaceTokenizer framework."""
          
                  super().__init__(component_config)
          
                  # flag to check whether to split intents
                  self.intent_tokenization_flag = self.component_config.get(
                      "intent_tokenization_flag", False
                  )
                  # split symbol for intents

it is possible that CRF extractor is not receiving the correct class type

 def required_components(cls) -> List[Type[Component]]:
        return [Tokenizer]

Topic		Replies	Views
Specify component input in RASA NLU Rasa Open Source	1	698	June 18, 2019
Using NER as a Feature for CRFEntityExtractor Rasa Open Source	6	1700	June 28, 2021
After using SpacyTokenizer: Misaligned entity annotation error when using CRFEntityExtraction Rasa Open Source	0	1050	February 24, 2020
Problem with using two different entity extractors Rasa Open Source	3	461	September 24, 2020
Pass custom features to CRFEntityExtractor Rasa Open Source	4	1197	August 12, 2019

CrfExtractor Pipeline

Related topics