Lemmatization & Punctuations

Hello dear community members,

How can we make lemmatization (getting the dictionary form of the tokens) and remove the punctuation? I have one another question. Is there any chance to get the processed words after nlu_configuration? I mean for example we tokenize the words, lemmatize them etc. Can we get the tokenized, lemmatized version of the words?Thank you very much in advance


whitespace tokenizer removes the punctuation. If spacy is included, we use lemma from spacy. So I guess it is already there. However, lemmatization is non trivial process that doesn’t always work well

1 Like

to get processed words, you can create custom component that would extract needed features from the Message object

Thank you very much Vladimir. I appreciate

If you use Rasa tokenizer tokenizer_spacy, then by default it will return verbatim text content, not lemma.

You should create custom tokenizer component based on tokenizer_spacy implementation to do lemmatization.

I wrote post about it.

1 Like

Thank you very much Tatiana.

Hi, I had modified the spacy_tokenizer.py file to lemmatize the user inputs and to remove stop words. File, import typing from typing import Any

from rasa.nlu.components import Component from rasa.nlu.config import RasaNLUModelConfig from rasa.nlu.tokenizers import Token, Tokenizer from rasa.nlu.training_data import Message, TrainingData


if typing.TYPE_CHECKING: from spacy.tokens.doc import Doc # pytype: disable=import-error import spacy import re nlp = spacy.load(‘en’) stop_words =[‘ours’, ‘keep’, ‘in’, ‘enough’, ‘anything’, ‘latterly’ , ‘thereupon’, ‘your’, ‘if’, ‘as’, ‘each’, ‘his’, ‘but’ , ‘everywhere’, ‘hereupon’, ‘being’, ‘becoming’, ‘and’, ‘anyhow’, ‘serious’, ‘something’, ‘latter’, ‘namely’, ‘name’, ‘seemed’, ‘yourselves’, ‘toward’, ‘must’, ‘same’, ‘then’, ‘become’, ‘while’, ‘becomes’, ‘ourselves’, ‘perhaps’, ‘or’, ‘more’, ‘whose’, ‘along’, ‘own’, ‘thence’, ‘had’, ‘itself’, ‘top’, ‘whether’, ‘beside’, ‘into’, ‘on’, ‘per’, ‘whole’, ‘one’, ‘towards’, ‘himself’, ‘against’, ‘beyond’, ‘off’, ‘done’, ‘are’, ‘you’, ‘he’, ‘yours’, ‘an’, ‘myself’, ‘themselves’, ‘hereafter’, ‘else’, ‘have’, ‘neither’, ‘again’, ‘afterwards’, ‘under’, ‘its’, ‘due’, ‘always’, ‘be’, ‘over’, ‘therefore’, ‘very’, ‘at’, ‘during’, ‘nobody’, ‘where’, ‘whoever’, ‘across’, ‘thereafter’, ‘i’, ‘thereby’, ‘empty’, ‘move’, ‘put’, ‘through’, ‘since’, ‘my’, ‘wherein’, ‘became’, ‘thus’, ‘none’, ‘cannot’, ‘did’, ‘next’, ‘above’, ‘regarding’, ‘to’, ‘too’, ‘within’, ‘just’, ‘nothing’, ‘now’, ‘am’, ‘part’, ‘seems’, ‘than’, ‘alone’, ‘after’, ‘once’, ‘doing’, ‘otherwise’, ‘who’, ‘indeed’, ‘full’, ‘whence’, ‘before’, ‘how’, ‘although’, ‘mostly’, ‘take’, ‘between’, ‘these’, ‘whereas’, ‘former’, ‘whom’, ‘many’, ‘amongst’, ‘other’, ‘ca’, ‘besides’, ‘go’, ‘much’, ‘may’, ‘nowhere’, ‘together’, ‘him’, ‘her’, ‘there’, ‘say’, ‘throughout’, ‘whereby’, ‘mine’, ‘formerly’, ‘only’, ‘really’, ‘herein’, ‘show’, ‘might’, ‘hers’, ‘often’, ‘when’, ‘whereupon’, ‘those’, ‘rather’, ‘somewhere’, ‘give’, ‘here’, ‘do’, ‘used’, ‘does’, ‘me’, ‘seem’, ‘unless’, ‘sometime’, ‘almost’, ‘via’, ‘back’, ‘hereby’, ‘few’, ‘all’, ‘up’, ‘using’, ‘should’, ‘well’, ‘see’, ‘been’, ‘various’, ‘yourself’, ‘bottom’, ‘onto’, ‘side’, ‘for’, ‘everyone’, ‘will’, ‘several’, ‘however’, ‘meanwhile’, ‘can’, ‘everything’, ‘around’, ‘she’, ‘of’, ‘their’, ‘were’, ‘get’, ‘until’, ‘that’, ‘yet’, ‘already’, ‘both’, ‘by’, ‘somehow’, ‘any’, ‘please’, ‘whereafter’, ‘behind’, ‘therein’, ‘the’, ‘they’, ‘whenever’, ‘out’, ‘still’, ‘our’, ‘most’, ‘least’, ‘though’, ‘with’, ‘a’, ‘could’, ‘such’, ‘less’, ‘was’, ‘nor’, ‘others’, ‘why’, ‘about’, ‘never’, ‘so’, ‘us’, ‘wherever’, ‘beforehand’, ‘moreover’, ‘last’, ‘among’, ‘elsewhere’, ‘nevertheless’, ‘quite’, ‘upon’, ‘ever’, ‘anywhere’, ‘we’, ‘down’, ‘what’, ‘amount’, ‘whither’, ‘it’, ‘below’, ‘someone’, ‘either’, ‘is’, ‘some’, ‘even’, ‘also’, ‘from’, ‘except’, ‘further’, ‘herself’, ‘make’, ‘which’, ‘this’, ‘call’, ‘without’, ‘made’, ‘re’, ‘sometimes’, ‘another’, ‘whatever’, ‘anyone’, ‘would’, ‘every’, ‘thru’, ‘them’, ‘anyway’, ‘hence’, ‘has’, ‘because’, ‘seeming’,“what’s”,“whats”,’-PRON-’,‘iam’, ‘im’,“i’m”,“what’s”,“whats”,‘am’] class SpacyTokenizer(Tokenizer, Component):

name = "tokenizer_spacy_lemma"
provides = ["tokens"]
requires = ["spacy_doc"]

def train(self,
          training_data: TrainingData,
          config: RasaNLUModelConfig,
          **kwargs: Any)-> None:

    for example in training_data.training_examples:
        example.set("tokens", self.tokenize(example.get("spacy_doc")))

def process(self, message: Message, **kwargs: Any)-> None:
    #message = nlp(message)
    message.set("tokens", self.tokenize(message.text))

def tokenize(self, doc):
    words = re.sub(r'[.,!?]+(\s|$)', ' ', doc).split()
    toq = [tok for tok in words if not tok in stop_words]
    doc1 = nlp(str(' '.join(toq)))
    words = [str(lemm.lemma_) for lemm in doc1]
    words = [re.sub(r'[^\x00-\x7f]','',re.sub('[\t\r\n,)([\]!%|!#$%&*+,.-/:;<=>?@^_`{|}~?]','',str(i))).strip() for i in words]
    tokens = []
    texts = ' '.join(words)
    running_offset = 0
    for word in words:
        word_offset = texts.index(word, running_offset)
        word_len = len(word)
        running_offset = word_offset + word_len
        tokens.append(Token(word, word_offset))
    return tokens

my nlu.md,

  • who is the owner for pv first
  • leader of pv first
  • who owns pv first
  • who controls pv first
  • who oreders pv first
  • who is the owner for alsc
  • leader of alsc
  • who owns alsc
  • who controls alsc
  • who oreders alsc
  • who is the owner for ucr
  • leader of ucr
  • who owns ucr
  • who controls ucr
  • who oreders ucr
  • who is the owner for arw
  • leader of arw
  • who owns arw
  • who controls arw
  • who oreders arw
  • who is the owner for coip
  • leader of coip
  • who owns coip
  • who owns coip
  • who owns coip
  • who owns coip
  • who controls coip
  • who oreders coip
  • who is the owner for cdisc
  • leader of cdisc
  • who owns cdisc
  • who controls cdisc
  • who oreders cdisc

And when I debug in rasa shell nlu it I got the following results,

case 1 -> user input -> “owner”

shell output -> debugged log [‘owner’] { “intent”: { “name”: “owner”, “confidence”: 0.9947196496583003 }, “entities”: [], “intent_ranking”: [ { “name”: “owner”, “confidence”: 0.9947196496583003 }, { “name”: “out_of_scope”, “confidence”: 0.001791049208563651 }, { “name”: “thank_you”, “confidence”: 0.001411993969675 }, { “name”: “greet”, “confidence”: 0.0007976021964830285 }, { “name”: “inform”, “confidence”: 0.00047270412751516944 }, { “name”: “person_enquiry”, “confidence”: 0.0004340739228840122 }, { “name”: “client_info”, “confidence”: 0.00019387738146468518 }, { “name”: “project_usecase”, “confidence”: 0.00017904953511428266 } ], “text”: “owner” }

case 2-> user input -> “owners”

debugged log -> [‘owner’]

{ “intent”: { “name”: “owner”, “confidence”: 0.9253092600222103 }, “entities”: [], “intent_ranking”: [ { “name”: “owner”, “confidence”: 0.9253092600222103 }, { “name”: “client_info”, “confidence”: 0.057077033004912105 }, { “name”: “inform”, “confidence”: 0.007490610202169739 }, { “name”: “greet”, “confidence”: 0.003935998500918612 }, { “name”: “out_of_scope”, “confidence”: 0.0026231646658204386 }, { “name”: “thank_you”, “confidence”: 0.0016098162847478703 }, { “name”: “person_enquiry”, “confidence”: 0.0010423624130542462 }, { “name”: “project_usecase”, “confidence”: 0.000911754906166398 } ], “text”: “owners” }

As you can see from here, the confidence score for “owner” is 0.9947196496583003 confidence score for “owners” is 0.9253092600222103

why is there a differemce in confidence score. Am I am proceeding correctly or is there anything that needs to be changed in code. Can someone comment on this.

And my pipeline is, language: “en”


  • name: “SpacyNLP”
  • name: “SpacyTokenizer”
  • name: “SpacyFeaturizer”
  • name: “CRFEntityExtractor”
  • name: “EntitySynonymMapper”
  • name: “SklearnIntentClassifier”

@TatianaParshina @Ghostvv- Could you please comment on this

SpacyFeaturizer doesn’t use tokens, it takes doc.vector as a feature, probably spacy vectors for owners and owner are different

@Ghostvv THANK YOU for the response. Is there any way that I can pre-process the input with lemmatization , stop words removal,… with spacy featurizer in pipeline?

spacy creates a vector for a sentence, you need to check spacy documentation whether it uses lemmas. For stop word removal, you need add a custom component, or maybe there is an option in spacy for it