Hi @keerti, welcome to the Rasa forum! It’s cool that you’re building a custom component. Are you adding your custom component in your pipeline? if so, what does it look like?
Hi @mloubser, I have added the custom component in the pipeline this way:-
pipeline: - name: "Lemmatizationtokenizer" - name: WhitespaceTokenizer - name: RegexFeaturizer - name: CRFEntityExtractor - name: EntitySynonymMapper - name: CountVectorsFeaturizer - name: EmbeddingIntentClassifier
thanks. What version of Rasa are you on? looks like something pre-1.8 right?
I am using rasa version:- 1.6.1
Ok thanks. It looks like you’re using two constant names that aren’t available - TEXT_ATTRIBUTE
and INTENT_ATTRIBUTE
. I think you want INTENT
and TEXT
respectively.
*Edit: Sorry, this is true for higher versions of rasa, not the one you’re on.
I’m not sure what code you pulled to start this - could you link to the file on Github that you were looking at?
Another note - doesn’t look like you’re using the stop words you are loading anywhere?
Sorry, edited my comment on that - if you upgrade to a higher version of rasa, that will be true, in rasa1.6.1, you have the right names.
re. stop words - I see that you’re loading them, but where are they being used?
What happens when you run rasa train nlu
? Does it train successfully?
If yes, what happens with rasa shell nlu
and then entering something like test message-things-to ? remove .
. ?
sorry, my mistake. I removed the stop-words logic before submitting this question here. I trained the model the same you mentioned above:- “rasa train nlu” It trained successfully.
I run it using “rasa shell --debug”.
Suppose I am entering input like:- “flights timings to India?”. I am using print statement in the custom component. There I am getting “flight timing to india”.
But for intent and entity recognize there my log shows:- "rasa.core.processor - Received user message ‘flight timings to india?’ with intent …"
Ok, I see what’s happening here. In these two functions in rasa.nlu.training_data.message, you see how the text
attribute is handled differently than the other attributes? so message.data["text"]
will be what you’re expecting, but message.text
will be left untouched.
def set(self, prop, info, add_to_output=False) -> None:
self.data[prop] = info
if add_to_output:
self.output_properties.add(prop)
def get(self, prop, default=None) -> Any:
if prop == TEXT_ATTRIBUTE:
return self.text
return self.data.get(prop, default)
You can get around this by doing:
def process(self, message: Message, **kwargs: Any) -> None:
message.text = self.tokenize(message.text)
However, since you’re creating a tokenizer, it would make sense for your class to return tokens, not text, which would be handled correctly. Using WhitespaceTokenizer
over pre-tokenized text wouldn’t really make sense, in my opinion.
For example, I passed the sentence “Are you a bot?” to the model, and Lemmatizationtokenizer returns
be -PRON- a bot?
But after WhitespaceTokenizer
is done with it, you get
be pron a bot?
I changed the code to this :
But I am still facing the same problem. And I am handling the -PRON- using stopwords in my code.
I don’t want to create Tokenizer. I just want to take the input and do Lemmatization and forward that input to next nlu pipeline.
Ok, if you don’t want to tokenize then it makes sense to edit the text, I was going off of the class name.
Did you remove everything from your tokenize function? With the code below (basically the same as your original, except the train and process functions), I got the results that follow:
config.yml
language: en
pipeline:
- name: Lemmatizer.Lemmatizer
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: CRFEntityExtractor
- name: EntitySynonymMapper
- name: CountVectorsFeaturizer
- name: EmbeddingIntentClassifier
Lemmatizer.py
import re
from typing import Any, Optional, Text, Dict, List, Type
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.constants import (
TEXT_ATTRIBUTE,
TOKENS_NAMES,
MESSAGE_ATTRIBUTES,
)
from rasa.nlu.components import Component
import os
import pickle
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import spacy
nlp = spacy.load("en")
class Lemmatizer(Component):
@classmethod
def required_components(cls) -> List[Type[Component]]:
"""Specify which components need to be present in the pipeline."""
return []
#language_list = ["en", "fr"]
language_list = None
defaults = {}
provides = ["text"]
name = "Lemmatizationtokenizer"
def __init__(self, component_config: Dict[Text, Any] = None) -> None:
super().__init__(component_config)
def train(
self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
) -> None:
for example in training_data.training_examples:
example.set(
TEXT_ATTRIBUTE,
self.tokenize(example.get(TEXT_ATTRIBUTE))
)
def process(self, message: Message, **kwargs: Any) -> None:
message.text = self.tokenize(message.text)
message.set(TEXT_ATTRIBUTE, self.tokenize(message.text))
def tokenize(self, text: Text):
### English Lemmatization using spacy
if text:
text=text.replace("-"," ")
text=text.lower()
doc=nlp(str(text))
lst=[]
for token in doc:
lst.append(token.lemma_)
text=' '.join(lst)
text=text.replace(" ?","?").replace(" .",".")
text=text.lstrip().rstrip()
return text
@classmethod
def load(
cls,
meta: Dict[Text, Any],
model_dir: Optional[Text] = None,
model_metadata: Optional["Metadata"] = None,
cached_component: Optional["Component"] = None,
**kwargs: Any,
) -> "Component":
"""Load this component from file."""
if cached_component:
return cached_component
else:
return cls(meta)
rasa shell nlu
NLU model loaded. Type a message and press enter to parse it.
Next message:
Flights to india
{
"intent": {
"name": "greet",
"confidence": 0.533901035785675
},
"entities": [],
"intent_ranking": [
{
"name": "greet",
"confidence": 0.533901035785675
},
{
"name": "goodbye",
"confidence": 0.46609896421432495
}
],
"text": "flight to india"
}
Next message:
are you a bot?
{
"intent": {
"name": "greet",
"confidence": 0.533901035785675
},
"entities": [],
"intent_ranking": [
{
"name": "greet",
"confidence": 0.533901035785675
},
{
"name": "goodbye",
"confidence": 0.46609896421432495
}
],
"text": "be -PRON- a bot?"
}
Next message:
I-have-dashes-everywhere
{
"intent": {
"name": "greet",
"confidence": 0.533901035785675
},
"entities": [],
"intent_ranking": [
{
"name": "greet",
"confidence": 0.533901035785675
},
{
"name": "goodbye",
"confidence": 0.46609896421432495
}
],
"text": "i have dash everywhere"
}
Ok, I think those two issues are unrelated to your lemmatizer. The first says you should check your annotation of custom entities so you don’t have e.g. This is an [entity ](ent) with trailing whitespace
or this [entity, ](ent) includes punctuation
.
the second warning says you should check that you haven’t defined two values for a single entity e.g.
- I work for [Rasa](company:rasa)
- I work for [Rasa](company:RasaTechnologies)
That is considered conflicting synonym definitions