Rasa com

I am trying a create for NLU. I want to preprocess the user input before we do anything in the NLU pipeline like- lemmatization. I took the code from rasa GitHub and modified it as per my requirement. I printed the input in this and I am able to see the input getting modified. But still, It is not adding in the input text for intent and entity recognization. Here is my code:-

Hi @keerti, welcome to the Rasa forum! It’s cool that you’re building a custom component. Are you adding your custom component in your pipeline? if so, what does it look like?

Hi @mloubser, I have added the custom component in the pipeline this way:-

 pipeline:
    - name: "Lemmatizationtokenizer"
    - name: WhitespaceTokenizer
    - name: RegexFeaturizer
    - name: CRFEntityExtractor
    - name: EntitySynonymMapper
    - name: CountVectorsFeaturizer
    - name: EmbeddingIntentClassifier

thanks. What version of Rasa are you on? looks like something pre-1.8 right?

I am using rasa version:- 1.6.1

Ok thanks. It looks like you’re using two constant names that aren’t available - TEXT_ATTRIBUTE and INTENT_ATTRIBUTE. I think you want INTENT and TEXT respectively.

*Edit: Sorry, this is true for higher versions of rasa, not the one you’re on.

I’m not sure what code you pulled to start this - could you link to the file on Github that you were looking at?

Another note - doesn’t look like you’re using the stop words you are loading anywhere?

Sorry, edited my comment on that - if you upgrade to a higher version of rasa, that will be true, in rasa1.6.1, you have the right names.

re. stop words - I see that you’re loading them, but where are they being used?

What happens when you run rasa train nlu? Does it train successfully?

If yes, what happens with rasa shell nlu and then entering something like test message-things-to ? remove .. ?

sorry, my mistake. I removed the stop-words logic before submitting this question here. I trained the model the same you mentioned above:- “rasa train nlu” It trained successfully.

I run it using “rasa shell --debug”.

Suppose I am entering input like:- “flights timings to India?”. I am using print statement in the custom component. There I am getting “flight timing to india”.

But for intent and entity recognize there my log shows:- "rasa.core.processor - Received user message ‘flight timings to india?’ with intent …"

Ok, I see what’s happening here. In these two functions in rasa.nlu.training_data.message, you see how the text attribute is handled differently than the other attributes? so message.data["text"] will be what you’re expecting, but message.text will be left untouched.

    def set(self, prop, info, add_to_output=False) -> None:
        self.data[prop] = info
        if add_to_output:
            self.output_properties.add(prop)

    def get(self, prop, default=None) -> Any:
        if prop == TEXT_ATTRIBUTE:
            return self.text
        return self.data.get(prop, default)

You can get around this by doing:

    def process(self, message: Message, **kwargs: Any) -> None:
        message.text = self.tokenize(message.text)

However, since you’re creating a tokenizer, it would make sense for your class to return tokens, not text, which would be handled correctly. Using WhitespaceTokenizer over pre-tokenized text wouldn’t really make sense, in my opinion.

For example, I passed the sentence “Are you a bot?” to the model, and Lemmatizationtokenizer returns

be -PRON- a bot?

But after WhitespaceTokenizer is done with it, you get

be   pron   a bot?

I changed the code to this :

But I am still facing the same problem. And I am handling the -PRON- using stopwords in my code.

I don’t want to create Tokenizer. I just want to take the input and do Lemmatization and forward that input to next nlu pipeline.

Ok, if you don’t want to tokenize then it makes sense to edit the text, I was going off of the class name.

Did you remove everything from your tokenize function? With the code below (basically the same as your original, except the train and process functions), I got the results that follow:

config.yml

language: en
pipeline: 
  - name: Lemmatizer.Lemmatizer
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: CRFEntityExtractor
  - name: EntitySynonymMapper
  - name: CountVectorsFeaturizer
  - name: EmbeddingIntentClassifier

Lemmatizer.py

import re
from typing import Any, Optional, Text, Dict, List, Type
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.constants import (
    TEXT_ATTRIBUTE,
    TOKENS_NAMES,
    MESSAGE_ATTRIBUTES,
)
from rasa.nlu.components import Component
import os
import pickle
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import spacy
nlp = spacy.load("en")
class Lemmatizer(Component):
    @classmethod
    def required_components(cls) -> List[Type[Component]]:
        """Specify which components need to be present in the pipeline."""
        return []
    #language_list = ["en", "fr"]
    language_list = None
    defaults = {}
    provides = ["text"]
    name = "Lemmatizationtokenizer"
    def __init__(self, component_config: Dict[Text, Any] = None) -> None:
        super().__init__(component_config)
    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
    ) -> None:
        for example in training_data.training_examples:
            example.set(
                TEXT_ATTRIBUTE,
                self.tokenize(example.get(TEXT_ATTRIBUTE))
            )
    def process(self, message: Message, **kwargs: Any) -> None:
        message.text = self.tokenize(message.text)
        message.set(TEXT_ATTRIBUTE, self.tokenize(message.text))
    def tokenize(self, text: Text):
        ### English Lemmatization using spacy
        if text:
            text=text.replace("-"," ")
            text=text.lower()
            doc=nlp(str(text))
            lst=[]
            for token in doc:
                lst.append(token.lemma_)
            text=' '.join(lst)
            text=text.replace(" ?","?").replace(" .",".")
            text=text.lstrip().rstrip()
        return text
    @classmethod
    def load(
        cls,
        meta: Dict[Text, Any],
        model_dir: Optional[Text] = None,
        model_metadata: Optional["Metadata"] = None,
        cached_component: Optional["Component"] = None,
        **kwargs: Any,
    ) -> "Component":
        """Load this component from file."""
        if cached_component:
            return cached_component
        else:
            return cls(meta)

rasa shell nlu

NLU model loaded. Type a message and press enter to parse it.
Next message:
Flights to india
{
  "intent": {
    "name": "greet",
    "confidence": 0.533901035785675
  },
  "entities": [],
  "intent_ranking": [
    {
      "name": "greet",
      "confidence": 0.533901035785675
    },
    {
      "name": "goodbye",
      "confidence": 0.46609896421432495
    }
  ],
  "text": "flight to india"
}
Next message:
are you a bot?
{
  "intent": {
    "name": "greet",
    "confidence": 0.533901035785675
  },
  "entities": [],
  "intent_ranking": [
    {
      "name": "greet",
      "confidence": 0.533901035785675
    },
    {
      "name": "goodbye",
      "confidence": 0.46609896421432495
    }
  ],
  "text": "be -PRON- a bot?"
}
Next message:
I-have-dashes-everywhere
{
  "intent": {
    "name": "greet",
    "confidence": 0.533901035785675
  },
  "entities": [],
  "intent_ranking": [
    {
      "name": "greet",
      "confidence": 0.533901035785675
    },
    {
      "name": "goodbye",
      "confidence": 0.46609896421432495
    }
  ],
  "text": "i have dash everywhere"
}

Ok, I think those two issues are unrelated to your lemmatizer. The first says you should check your annotation of custom entities so you don’t have e.g. This is an [entity ](ent) with trailing whitespace or this [entity, ](ent) includes punctuation.

the second warning says you should check that you haven’t defined two values for a single entity e.g.

- I work for [Rasa](company:rasa)
- I work for [Rasa](company:RasaTechnologies)

That is considered conflicting synonym definitions