Specify component input in RASA NLU

Hi everyone,

Let’s suppose I use the CRFEntityExtractor and the SpacyEntityExtractor at the same time in my pipeline. Of what I’ve read, it’s better to use the first one with WhitespaceTokenizer and the second with SpacyTokenizer.

If I put all these components in my pipeline, can I specify wich NER uses wich tokenizer ?

Thank you,


Sure, you can. Although you need to override the defaults of quite a few things.
If you look at the source code, both components (Whitespace and Spacy Tokenizer) provide tokens.

You need to change this to provide something different like whitespace_tokens and spacy_tokens.
And then you need to change the NER to include a logic to select which tokens to use. You may also need to change the component configuration down the line if any of the component require tokens.

The code would look something like:

# whitespace_tokenizer
class WhitespaceTokenizer(Tokenizer, Component):
    provides = ["whitespace_tokens"]

# spacy_tokenizer
class SpacyTokenizer(Tokenizer, Component):
    provides = ["spacy_tokens"]
    requires = ["spacy_doc"]

# crf_entity_extractor
class CRFEntityExtractor(EntityExtractor):
    provides = ["entities"]
    requires = ["whitespace_tokens"]

# spacy_entity_extractor
class SpacyEntityExtractor(EntityExtractor):
    provides = ["entities"]
    requires = ["spacy_tokens", "spacy_nlp"]

# If you want some fine control then you may need to do it the following way:
# custom_entity_extractor
class CustomEntityExtractor(EntityExtractor):
    provides = ["entities"]
    requires = ["whitespace_tokens", "spacy_tokens"]

    def __init__(self, component_config=None):
        super(CustomEntityExtractor, self).__init__(component_config)
        self.crf = CRFEntityExtractor(component_config)
        self.spacy_ner = SpacyEntityExtractor(component_config)

    # You need to override the train and process methods with your logic
    # about which to use
    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
    ) -> None:
         # custom logic to select which entity extractor to use
         # spacy extractor cannot be trained so train CRF here

    def process(self, message: Message, **kwargs: Any) -> None:
        # some logic to select which to use
        # extract entities to extracted
            "entities", message.get("entities", []) + extracted, add_to_output=True

You’ll need to figure out the exact details on how to implement this. (I haven’t done it yet).

Hope that helps.