CrfExtractor Pipeline

Hello everyone , I tried to train my model with ner crf Extractor to extract some custom entities using a custom whitespace tokenizer but it gives the bellow error :
‘CRFEntityExtractor’ requires [‘Tokenizer’]

And also when I try to use the normal whitespacetokenizer it still throws the same error , should I use a specific tokenizer or is the pipeline wrong , can you please help : Here is my pipeline

language: “en”

pipeline:

  • name: “WhitespaceTokenizer”

  • name: “CRFEntityExtractor”

  • name: “ner_synonyms”

  • name: “CountVectorsFeaturizer”

    “OOV_token”: “oov”

  • name: “intent_featurizer_count_vectors”

  • name: “intent_classifier_tensorflow_embedding”

what is in your custom whitespace tokenizer??

Hello , the custom whitespace tokenizer is working great it looks like this :

class WhitespaceTokenizer_ar_cdg(Tokenizer, Component):

provides = [TOKENS_NAMES[attribute] for attribute in MESSAGE_ATTRIBUTES]

def unique_words(lines):

    return set(chain(*(line.split() for line in lines if line)))



dict= {}



defaults = {

    # text will be tokenized with case sensitive as default

    "case_sensitive": True

}

def __init__(self, component_config: Dict[Text, Any] = None) -> None:

    """Construct a new tokenizer using the WhitespaceTokenizer framework."""

    super(WhitespaceTokenizer_ar_cdg, self).__init__(component_config)

    self.case_sensitive = self.component_config["case_sensitive"]

The same as the original whitespacetokenizer I just added some code , that’s not the issue cause it’s working fine if I remove the CRFextractor . Once I add this specific extractor , it throws the error and even when I use just the normal whitespacetokenizer it doesnt work , still gives me this error ‘CRFEntityExtractor’ requires [‘Tokenizer’] as I showed below .

i am not really sure, but you are inherting the Component class in your Tokenizer?

could you simply inherit -

class WhitespaceTokenizer_ar_cdg(Tokenizer):

because the Tokenizer class already inherits the components

it is possible that CRF extractor is not receiving the correct class type

 def required_components(cls) -> List[Type[Component]]:
        return [Tokenizer]