Lemmatization & Punctuations

serdar · March 28, 2019, 5:42pm

Hello dear community members,

How can we make lemmatization (getting the dictionary form of the tokens) and remove the punctuation? I have one another question. Is there any chance to get the processed words after nlu_configuration? I mean for example we tokenize the words, lemmatize them etc. Can we get the tokenized, lemmatized version of the words?Thank you very much in advance

Serdar

Ghostvv · April 4, 2019, 3:35pm

whitespace tokenizer removes the punctuation. If spacy is included, we use lemma from spacy. So I guess it is already there. However, lemmatization is non trivial process that doesn’t always work well

Ghostvv · April 4, 2019, 3:36pm

to get processed words, you can create custom component that would extract needed features from the Message object

serdar · April 5, 2019, 2:42pm

Thank you very much Vladimir. I appreciate

TatianaParshina · April 14, 2019, 6:47pm

If you use Rasa tokenizer tokenizer_spacy, then by default it will return verbatim text content, not lemma.

You should create custom tokenizer component based on tokenizer_spacy implementation to do lemmatization.

I wrote post about it.

serdar · April 14, 2019, 6:59pm

Thank you very much Tatiana.

Vighnesh · September 19, 2019, 4:12am

Hi, I had modified the spacy_tokenizer.py file to lemmatize the user inputs and to remove stop words. File, import typing from typing import Any

from rasa.nlu.components import Component from rasa.nlu.config import RasaNLUModelConfig from rasa.nlu.tokenizers import Token, Tokenizer from rasa.nlu.training_data import Message, TrainingData

from rasa.nlu.constants import ( MESSAGE_RESPONSE_ATTRIBUTE, MESSAGE_INTENT_ATTRIBUTE, MESSAGE_TEXT_ATTRIBUTE, MESSAGE_TOKENS_NAMES, MESSAGE_ATTRIBUTES, MESSAGE_SPACY_FEATURES_NAMES, MESSAGE_VECTOR_FEATURE_NAMES, )

if typing.TYPE_CHECKING: from spacy.tokens.doc import Doc # pytype: disable=import-error import spacy import re nlp = spacy.load(‘en’) stop_words =[‘ours’, ‘keep’, ‘in’, ‘enough’, ‘anything’, ‘latterly’ , ‘thereupon’, ‘your’, ‘if’, ‘as’, ‘each’, ‘his’, ‘but’ , ‘everywhere’, ‘hereupon’, ‘being’, ‘becoming’, ‘and’, ‘anyhow’, ‘serious’, ‘something’, ‘latter’, ‘namely’, ‘name’, ‘seemed’, ‘yourselves’, ‘toward’, ‘must’, ‘same’, ‘then’, ‘become’, ‘while’, ‘becomes’, ‘ourselves’, ‘perhaps’, ‘or’, ‘more’, ‘whose’, ‘along’, ‘own’, ‘thence’, ‘had’, ‘itself’, ‘top’, ‘whether’, ‘beside’, ‘into’, ‘on’, ‘per’, ‘whole’, ‘one’, ‘towards’, ‘himself’, ‘against’, ‘beyond’, ‘off’, ‘done’, ‘are’, ‘you’, ‘he’, ‘yours’, ‘an’, ‘myself’, ‘themselves’, ‘hereafter’, ‘else’, ‘have’, ‘neither’, ‘again’, ‘afterwards’, ‘under’, ‘its’, ‘due’, ‘always’, ‘be’, ‘over’, ‘therefore’, ‘very’, ‘at’, ‘during’, ‘nobody’, ‘where’, ‘whoever’, ‘across’, ‘thereafter’, ‘i’, ‘thereby’, ‘empty’, ‘move’, ‘put’, ‘through’, ‘since’, ‘my’, ‘wherein’, ‘became’, ‘thus’, ‘none’, ‘cannot’, ‘did’, ‘next’, ‘above’, ‘regarding’, ‘to’, ‘too’, ‘within’, ‘just’, ‘nothing’, ‘now’, ‘am’, ‘part’, ‘seems’, ‘than’, ‘alone’, ‘after’, ‘once’, ‘doing’, ‘otherwise’, ‘who’, ‘indeed’, ‘full’, ‘whence’, ‘before’, ‘how’, ‘although’, ‘mostly’, ‘take’, ‘between’, ‘these’, ‘whereas’, ‘former’, ‘whom’, ‘many’, ‘amongst’, ‘other’, ‘ca’, ‘besides’, ‘go’, ‘much’, ‘may’, ‘nowhere’, ‘together’, ‘him’, ‘her’, ‘there’, ‘say’, ‘throughout’, ‘whereby’, ‘mine’, ‘formerly’, ‘only’, ‘really’, ‘herein’, ‘show’, ‘might’, ‘hers’, ‘often’, ‘when’, ‘whereupon’, ‘those’, ‘rather’, ‘somewhere’, ‘give’, ‘here’, ‘do’, ‘used’, ‘does’, ‘me’, ‘seem’, ‘unless’, ‘sometime’, ‘almost’, ‘via’, ‘back’, ‘hereby’, ‘few’, ‘all’, ‘up’, ‘using’, ‘should’, ‘well’, ‘see’, ‘been’, ‘various’, ‘yourself’, ‘bottom’, ‘onto’, ‘side’, ‘for’, ‘everyone’, ‘will’, ‘several’, ‘however’, ‘meanwhile’, ‘can’, ‘everything’, ‘around’, ‘she’, ‘of’, ‘their’, ‘were’, ‘get’, ‘until’, ‘that’, ‘yet’, ‘already’, ‘both’, ‘by’, ‘somehow’, ‘any’, ‘please’, ‘whereafter’, ‘behind’, ‘therein’, ‘the’, ‘they’, ‘whenever’, ‘out’, ‘still’, ‘our’, ‘most’, ‘least’, ‘though’, ‘with’, ‘a’, ‘could’, ‘such’, ‘less’, ‘was’, ‘nor’, ‘others’, ‘why’, ‘about’, ‘never’, ‘so’, ‘us’, ‘wherever’, ‘beforehand’, ‘moreover’, ‘last’, ‘among’, ‘elsewhere’, ‘nevertheless’, ‘quite’, ‘upon’, ‘ever’, ‘anywhere’, ‘we’, ‘down’, ‘what’, ‘amount’, ‘whither’, ‘it’, ‘below’, ‘someone’, ‘either’, ‘is’, ‘some’, ‘even’, ‘also’, ‘from’, ‘except’, ‘further’, ‘herself’, ‘make’, ‘which’, ‘this’, ‘call’, ‘without’, ‘made’, ‘re’, ‘sometimes’, ‘another’, ‘whatever’, ‘anyone’, ‘would’, ‘every’, ‘thru’, ‘them’, ‘anyway’, ‘hence’, ‘has’, ‘because’, ‘seeming’,“what’s”,“whats”,’-PRON-’,‘iam’, ‘im’,“i’m”,“what’s”,“whats”,‘am’] class SpacyTokenizer(Tokenizer, Component):

name = "tokenizer_spacy_lemma"
provides = ["tokens"]
requires = ["spacy_doc"]

def train(self,
          training_data: TrainingData,
          config: RasaNLUModelConfig,
          **kwargs: Any)-> None:

    for example in training_data.training_examples:
        example.set("tokens", self.tokenize(example.get("spacy_doc")))

def process(self, message: Message, **kwargs: Any)-> None:
    #message = nlp(message)
    print("********************")
    print(message)
    print(message.get("spacy_doc"))
    print(message.text)
    message.set("tokens", self.tokenize(message.text))

def tokenize(self, doc):
    doc=str(doc)
    words = re.sub(r'[.,!?]+(\s|$)', ' ', doc).split()
    print(type(doc))
    toq = [tok for tok in words if not tok in stop_words]
    doc1 = nlp(str(' '.join(toq)))
    words = [str(lemm.lemma_) for lemm in doc1]
    words = [re.sub(r'[^\x00-\x7f]','',re.sub('[\t\r\n,)([\]!%|!#$%&*+,.-/:;<=>?@^_`{|}~?]','',str(i))).strip() for i in words]
    tokens = []
    texts = ' '.join(words)
    running_offset = 0
    print(words)
    for word in words:
        word_offset = texts.index(word, running_offset)
        word_len = len(word)
        running_offset = word_offset + word_len
        tokens.append(Token(word, word_offset))
    print(tokens)
    return tokens

my nlu.md,

who is the owner for pv first
leader of pv first
who owns pv first
who controls pv first
who oreders pv first
who is the owner for alsc
leader of alsc
who owns alsc
who controls alsc
who oreders alsc
who is the owner for ucr
leader of ucr
who owns ucr
who controls ucr
who oreders ucr
who is the owner for arw
leader of arw
who owns arw
who controls arw
who oreders arw
who is the owner for coip
leader of coip
who owns coip
who owns coip
who owns coip
who owns coip
who controls coip
who oreders coip
who is the owner for cdisc
leader of cdisc
who owns cdisc
who controls cdisc
who oreders cdisc

And when I debug in rasa shell nlu it I got the following results,

case 1 -> user input -> “owner”

shell output -> debugged log [‘owner’] { “intent”: { “name”: “owner”, “confidence”: 0.9947196496583003 }, “entities”: [], “intent_ranking”: [ { “name”: “owner”, “confidence”: 0.9947196496583003 }, { “name”: “out_of_scope”, “confidence”: 0.001791049208563651 }, { “name”: “thank_you”, “confidence”: 0.001411993969675 }, { “name”: “greet”, “confidence”: 0.0007976021964830285 }, { “name”: “inform”, “confidence”: 0.00047270412751516944 }, { “name”: “person_enquiry”, “confidence”: 0.0004340739228840122 }, { “name”: “client_info”, “confidence”: 0.00019387738146468518 }, { “name”: “project_usecase”, “confidence”: 0.00017904953511428266 } ], “text”: “owner” }

case 2-> user input -> “owners”

debugged log -> [‘owner’]

{ “intent”: { “name”: “owner”, “confidence”: 0.9253092600222103 }, “entities”: [], “intent_ranking”: [ { “name”: “owner”, “confidence”: 0.9253092600222103 }, { “name”: “client_info”, “confidence”: 0.057077033004912105 }, { “name”: “inform”, “confidence”: 0.007490610202169739 }, { “name”: “greet”, “confidence”: 0.003935998500918612 }, { “name”: “out_of_scope”, “confidence”: 0.0026231646658204386 }, { “name”: “thank_you”, “confidence”: 0.0016098162847478703 }, { “name”: “person_enquiry”, “confidence”: 0.0010423624130542462 }, { “name”: “project_usecase”, “confidence”: 0.000911754906166398 } ], “text”: “owners” }

As you can see from here, the confidence score for “owner” is 0.9947196496583003 confidence score for “owners” is 0.9253092600222103

why is there a differemce in confidence score. Am I am proceeding correctly or is there anything that needs to be changed in code. Can someone comment on this.

And my pipeline is, language: “en”

pipeline:

name: “SpacyNLP”
name: “SpacyTokenizer”
name: “SpacyFeaturizer”
name: “CRFEntityExtractor”
name: “EntitySynonymMapper”
name: “SklearnIntentClassifier”

@TatianaParshina @Ghostvv- Could you please comment on this

Ghostvv · September 19, 2019, 10:09am

SpacyFeaturizer doesn’t use tokens, it takes doc.vector as a feature, probably spacy vectors for owners and owner are different

Vighnesh · September 24, 2019, 7:22am

@Ghostvv THANK YOU for the response. Is there any way that I can pre-process the input with lemmatization , stop words removal,… with spacy featurizer in pipeline?

Ghostvv · September 25, 2019, 11:11am

spacy creates a vector for a sentence, you need to check spacy documentation whether it uses lemmas. For stop word removal, you need add a custom component, or maybe there is an option in spacy for it

Topic		Replies	Views
Able to lemmatize by modifying spacy_tokenizer, but the output confidence is differing for the same stem word Rasa Open Source	1	931	September 26, 2019
How to use lemma, POS and dependency of SpaCy in RASA NLU Rasa Open Source	3	924	April 13, 2019
NLU entity position misalignment due to custom Lemmatization Preprocessing Rasa Open Source	0	660	July 24, 2019
Questions of Rasa with Spacy Rasa Open Source	2	361	November 23, 2023
Rasa com Rasa Open Source	13	1573	April 24, 2020

Lemmatization & Punctuations

Related topics