Rasa com

keerti · April 21, 2020, 12:57pm

I am trying a create for NLU. I want to preprocess the user input before we do anything in the NLU pipeline like- lemmatization. I took the code from rasa GitHub and modified it as per my requirement. I printed the input in this and I am able to see the input getting modified. But still, It is not adding in the input text for intent and entity recognization. Here is my code:-

mloubser · April 21, 2020, 6:08pm

Hi @keerti, welcome to the Rasa forum! It’s cool that you’re building a custom component. Are you adding your custom component in your pipeline? if so, what does it look like?

keerti · April 21, 2020, 6:16pm

Hi @mloubser, I have added the custom component in the pipeline this way:-

 pipeline:
    - name: "Lemmatizationtokenizer"
    - name: WhitespaceTokenizer
    - name: RegexFeaturizer
    - name: CRFEntityExtractor
    - name: EntitySynonymMapper
    - name: CountVectorsFeaturizer
    - name: EmbeddingIntentClassifier

mloubser · April 21, 2020, 6:20pm

thanks. What version of Rasa are you on? looks like something pre-1.8 right?

keerti · April 21, 2020, 6:22pm

I am using rasa version:- 1.6.1

mloubser · April 21, 2020, 6:26pm

Ok thanks. It looks like you’re using two constant names that aren’t available - TEXT_ATTRIBUTE and INTENT_ATTRIBUTE. I think you want INTENT and TEXT respectively.

*Edit: Sorry, this is true for higher versions of rasa, not the one you’re on.

mloubser · April 21, 2020, 6:28pm

I’m not sure what code you pulled to start this - could you link to the file on Github that you were looking at?

mloubser · April 21, 2020, 6:34pm

Another note - doesn’t look like you’re using the stop words you are loading anywhere?

mloubser · April 21, 2020, 6:47pm

Sorry, edited my comment on that - if you upgrade to a higher version of rasa, that will be true, in rasa1.6.1, you have the right names.

re. stop words - I see that you’re loading them, but where are they being used?

What happens when you run rasa train nlu? Does it train successfully?

If yes, what happens with rasa shell nlu and then entering something like test message-things-to ? remove .. ?

keerti · April 21, 2020, 7:06pm

sorry, my mistake. I removed the stop-words logic before submitting this question here. I trained the model the same you mentioned above:- “rasa train nlu” It trained successfully.

I run it using “rasa shell --debug”.

Suppose I am entering input like:- “flights timings to India?”. I am using print statement in the custom component. There I am getting “flight timing to india”.

But for intent and entity recognize there my log shows:- "rasa.core.processor - Received user message ‘flight timings to india?’ with intent …"

mloubser · April 21, 2020, 7:39pm

Ok, I see what’s happening here. In these two functions in rasa.nlu.training_data.message, you see how the text attribute is handled differently than the other attributes? so message.data["text"] will be what you’re expecting, but message.text will be left untouched.

    def set(self, prop, info, add_to_output=False) -> None:
        self.data[prop] = info
        if add_to_output:
            self.output_properties.add(prop)

    def get(self, prop, default=None) -> Any:
        if prop == TEXT_ATTRIBUTE:
            return self.text
        return self.data.get(prop, default)

You can get around this by doing:

    def process(self, message: Message, **kwargs: Any) -> None:
        message.text = self.tokenize(message.text)

However, since you’re creating a tokenizer, it would make sense for your class to return tokens, not text, which would be handled correctly. Using WhitespaceTokenizer over pre-tokenized text wouldn’t really make sense, in my opinion.

For example, I passed the sentence “Are you a bot?” to the model, and Lemmatizationtokenizer returns

be -PRON- a bot?

But after WhitespaceTokenizer is done with it, you get

be   pron   a bot?

keerti · April 21, 2020, 8:14pm

I changed the code to this :

But I am still facing the same problem. And I am handling the -PRON- using stopwords in my code.

I don’t want to create Tokenizer. I just want to take the input and do Lemmatization and forward that input to next nlu pipeline.

mloubser · April 22, 2020, 1:29am

Ok, if you don’t want to tokenize then it makes sense to edit the text, I was going off of the class name.

Did you remove everything from your tokenize function? With the code below (basically the same as your original, except the train and process functions), I got the results that follow:

config.yml

language: en
pipeline: 
  - name: Lemmatizer.Lemmatizer
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: CRFEntityExtractor
  - name: EntitySynonymMapper
  - name: CountVectorsFeaturizer
  - name: EmbeddingIntentClassifier

Lemmatizer.py

import re
from typing import Any, Optional, Text, Dict, List, Type
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.constants import (
    TEXT_ATTRIBUTE,
    TOKENS_NAMES,
    MESSAGE_ATTRIBUTES,
)
from rasa.nlu.components import Component
import os
import pickle
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import spacy
nlp = spacy.load("en")
class Lemmatizer(Component):
    @classmethod
    def required_components(cls) -> List[Type[Component]]:
        """Specify which components need to be present in the pipeline."""
        return []
    #language_list = ["en", "fr"]
    language_list = None
    defaults = {}
    provides = ["text"]
    name = "Lemmatizationtokenizer"
    def __init__(self, component_config: Dict[Text, Any] = None) -> None:
        super().__init__(component_config)
    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
    ) -> None:
        for example in training_data.training_examples:
            example.set(
                TEXT_ATTRIBUTE,
                self.tokenize(example.get(TEXT_ATTRIBUTE))
            )
    def process(self, message: Message, **kwargs: Any) -> None:
        message.text = self.tokenize(message.text)
        message.set(TEXT_ATTRIBUTE, self.tokenize(message.text))
    def tokenize(self, text: Text):
        ### English Lemmatization using spacy
        if text:
            text=text.replace("-"," ")
            text=text.lower()
            doc=nlp(str(text))
            lst=[]
            for token in doc:
                lst.append(token.lemma_)
            text=' '.join(lst)
            text=text.replace(" ?","?").replace(" .",".")
            text=text.lstrip().rstrip()
        return text
    @classmethod
    def load(
        cls,
        meta: Dict[Text, Any],
        model_dir: Optional[Text] = None,
        model_metadata: Optional["Metadata"] = None,
        cached_component: Optional["Component"] = None,
        **kwargs: Any,
    ) -> "Component":
        """Load this component from file."""
        if cached_component:
            return cached_component
        else:
            return cls(meta)

rasa shell nlu

NLU model loaded. Type a message and press enter to parse it.
Next message:
Flights to india
{
  "intent": {
    "name": "greet",
    "confidence": 0.533901035785675
  },
  "entities": [],
  "intent_ranking": [
    {
      "name": "greet",
      "confidence": 0.533901035785675
    },
    {
      "name": "goodbye",
      "confidence": 0.46609896421432495
    }
  ],
  "text": "flight to india"
}
Next message:
are you a bot?
{
  "intent": {
    "name": "greet",
    "confidence": 0.533901035785675
  },
  "entities": [],
  "intent_ranking": [
    {
      "name": "greet",
      "confidence": 0.533901035785675
    },
    {
      "name": "goodbye",
      "confidence": 0.46609896421432495
    }
  ],
  "text": "be -PRON- a bot?"
}
Next message:
I-have-dashes-everywhere
{
  "intent": {
    "name": "greet",
    "confidence": 0.533901035785675
  },
  "entities": [],
  "intent_ranking": [
    {
      "name": "greet",
      "confidence": 0.533901035785675
    },
    {
      "name": "goodbye",
      "confidence": 0.46609896421432495
    }
  ],
  "text": "i have dash everywhere"
}

mloubser · April 24, 2020, 6:53pm

Ok, I think those two issues are unrelated to your lemmatizer. The first says you should check your annotation of custom entities so you don’t have e.g. This is an [entity ](ent) with trailing whitespace or this [entity, ](ent) includes punctuation.

the second warning says you should check that you haven’t defined two values for a single entity e.g.

- I work for [Rasa](company:rasa)
- I work for [Rasa](company:RasaTechnologies)

That is considered conflicting synonym definitions

Topic		Replies	Views
Custom NLU component for length of input string Rasa Open Source	5	605	June 28, 2020
Adding text preprocessing component to Rasa Rasa Open Source	4	1328	March 9, 2021
Custom nlu pipeline component Rasa Open Source	6	1826	December 8, 2021
Need help for data training Rasa Open Source	6	463	March 13, 2020
Writing a custom component to preprocess text Rasa Open Source	0	664	July 27, 2022

Rasa com

Related topics