Able to lemmatize by modifying spacy_tokenizer, but the output confidence is differing for the same stem word

Vighnesh · September 18, 2019, 1:15pm

Hi, I had modified the spacy_tokenizer.py file to lemmatize the user inputs and to remove stop words. File, import typing from typing import Any

from rasa.nlu.components import Component from rasa.nlu.config import RasaNLUModelConfig from rasa.nlu.tokenizers import Token, Tokenizer from rasa.nlu.training_data import Message, TrainingData

from rasa.nlu.constants import ( MESSAGE_RESPONSE_ATTRIBUTE, MESSAGE_INTENT_ATTRIBUTE, MESSAGE_TEXT_ATTRIBUTE, MESSAGE_TOKENS_NAMES, MESSAGE_ATTRIBUTES, MESSAGE_SPACY_FEATURES_NAMES, MESSAGE_VECTOR_FEATURE_NAMES, )

if typing.TYPE_CHECKING: from spacy.tokens.doc import Doc # pytype: disable=import-error import spacy import re nlp = spacy.load(‘en’) stop_words =[‘ours’, ‘keep’, ‘in’, ‘enough’, ‘anything’, ‘latterly’ , ‘thereupon’, ‘your’, ‘if’, ‘as’, ‘each’, ‘his’, ‘but’ , ‘everywhere’, ‘hereupon’, ‘being’, ‘becoming’, ‘and’, ‘anyhow’, ‘serious’, ‘something’, ‘latter’, ‘namely’, ‘name’, ‘seemed’, ‘yourselves’, ‘toward’, ‘must’, ‘same’, ‘then’, ‘become’, ‘while’, ‘becomes’, ‘ourselves’, ‘perhaps’, ‘or’, ‘more’, ‘whose’, ‘along’, ‘own’, ‘thence’, ‘had’, ‘itself’, ‘top’, ‘whether’, ‘beside’, ‘into’, ‘on’, ‘per’, ‘whole’, ‘one’, ‘towards’, ‘himself’, ‘against’, ‘beyond’, ‘off’, ‘done’, ‘are’, ‘you’, ‘he’, ‘yours’, ‘an’, ‘myself’, ‘themselves’, ‘hereafter’, ‘else’, ‘have’, ‘neither’, ‘again’, ‘afterwards’, ‘under’, ‘its’, ‘due’, ‘always’, ‘be’, ‘over’, ‘therefore’, ‘very’, ‘at’, ‘during’, ‘nobody’, ‘where’, ‘whoever’, ‘across’, ‘thereafter’, ‘i’, ‘thereby’, ‘empty’, ‘move’, ‘put’, ‘through’, ‘since’, ‘my’, ‘wherein’, ‘became’, ‘thus’, ‘none’, ‘cannot’, ‘did’, ‘next’, ‘above’, ‘regarding’, ‘to’, ‘too’, ‘within’, ‘just’, ‘nothing’, ‘now’, ‘am’, ‘part’, ‘seems’, ‘than’, ‘alone’, ‘after’, ‘once’, ‘doing’, ‘otherwise’, ‘who’, ‘indeed’, ‘full’, ‘whence’, ‘before’, ‘how’, ‘although’, ‘mostly’, ‘take’, ‘between’, ‘these’, ‘whereas’, ‘former’, ‘whom’, ‘many’, ‘amongst’, ‘other’, ‘ca’, ‘besides’, ‘go’, ‘much’, ‘may’, ‘nowhere’, ‘together’, ‘him’, ‘her’, ‘there’, ‘say’, ‘throughout’, ‘whereby’, ‘mine’, ‘formerly’, ‘only’, ‘really’, ‘herein’, ‘show’, ‘might’, ‘hers’, ‘often’, ‘when’, ‘whereupon’, ‘those’, ‘rather’, ‘somewhere’, ‘give’, ‘here’, ‘do’, ‘used’, ‘does’, ‘me’, ‘seem’, ‘unless’, ‘sometime’, ‘almost’, ‘via’, ‘back’, ‘hereby’, ‘few’, ‘all’, ‘up’, ‘using’, ‘should’, ‘well’, ‘see’, ‘been’, ‘various’, ‘yourself’, ‘bottom’, ‘onto’, ‘side’, ‘for’, ‘everyone’, ‘will’, ‘several’, ‘however’, ‘meanwhile’, ‘can’, ‘everything’, ‘around’, ‘she’, ‘of’, ‘their’, ‘were’, ‘get’, ‘until’, ‘that’, ‘yet’, ‘already’, ‘both’, ‘by’, ‘somehow’, ‘any’, ‘please’, ‘whereafter’, ‘behind’, ‘therein’, ‘the’, ‘they’, ‘whenever’, ‘out’, ‘still’, ‘our’, ‘most’, ‘least’, ‘though’, ‘with’, ‘a’, ‘could’, ‘such’, ‘less’, ‘was’, ‘nor’, ‘others’, ‘why’, ‘about’, ‘never’, ‘so’, ‘us’, ‘wherever’, ‘beforehand’, ‘moreover’, ‘last’, ‘among’, ‘elsewhere’, ‘nevertheless’, ‘quite’, ‘upon’, ‘ever’, ‘anywhere’, ‘we’, ‘down’, ‘what’, ‘amount’, ‘whither’, ‘it’, ‘below’, ‘someone’, ‘either’, ‘is’, ‘some’, ‘even’, ‘also’, ‘from’, ‘except’, ‘further’, ‘herself’, ‘make’, ‘which’, ‘this’, ‘call’, ‘without’, ‘made’, ‘re’, ‘sometimes’, ‘another’, ‘whatever’, ‘anyone’, ‘would’, ‘every’, ‘thru’, ‘them’, ‘anyway’, ‘hence’, ‘has’, ‘because’, ‘seeming’,“what’s”,“whats”,’-PRON-’,‘iam’, ‘im’,“i’m”,“what’s”,“whats”,‘am’] class SpacyTokenizer(Tokenizer, Component):

name = "tokenizer_spacy_lemma"
provides = ["tokens"]
requires = ["spacy_doc"]

def train(self,
          training_data: TrainingData,
          config: RasaNLUModelConfig,
          **kwargs: Any)-> None:

    for example in training_data.training_examples:
        example.set("tokens", self.tokenize(example.get("spacy_doc")))

def process(self, message: Message, **kwargs: Any)-> None:
    #message = nlp(message)
    print("********************")
    print(message)
    print(message.get("spacy_doc"))
    print(message.text)
    message.set("tokens", self.tokenize(message.text))

def tokenize(self, doc):
    doc=str(doc)
    words = re.sub(r'[.,!?]+(\s|$)', ' ', doc).split()
    print(type(doc))
    toq = [tok for tok in words if not tok in stop_words]
    doc1 = nlp(str(' '.join(toq)))
    words = [str(lemm.lemma_) for lemm in doc1]
    words = [re.sub(r'[^\x00-\x7f]','',re.sub('[\t\r\n,)([\]!%|!#$%&*+,.-/:;<=>?@^_`{|}~?]','',str(i))).strip() for i in words]
    tokens = []
    texts = ' '.join(words)
    running_offset = 0
    print(words)
    for word in words:
        word_offset = texts.index(word, running_offset)
        word_len = len(word)
        running_offset = word_offset + word_len
        tokens.append(Token(word, word_offset))
    print(tokens)
    return tokens

my nlu.md,

who is the owner for pv first
leader of pv first
who owns pv first
who controls pv first
who oreders pv first
who is the owner for alsc
leader of alsc
who owns alsc
who controls alsc
who oreders alsc
who is the owner for ucr
leader of ucr
who owns ucr
who controls ucr
who oreders ucr
who is the owner for arw
leader of arw
who owns arw
who controls arw
who oreders arw
who is the owner for coip
leader of coip
who owns coip
who owns coip
who owns coip
who owns coip
who controls coip
who oreders coip
who is the owner for cdisc
leader of cdisc
who owns cdisc
who controls cdisc
who oreders cdisc

And when I debug in rasa shell nlu it I got the following results,

case 1 -> user input -> “owner”

shell output -> debugged log [‘owner’] { “intent”: { “name”: “owner”, “confidence”: 0.9947196496583003 }, “entities”: [], “intent_ranking”: [ { “name”: “owner”, “confidence”: 0.9947196496583003 }, { “name”: “out_of_scope”, “confidence”: 0.001791049208563651 }, { “name”: “thank_you”, “confidence”: 0.001411993969675 }, { “name”: “greet”, “confidence”: 0.0007976021964830285 }, { “name”: “inform”, “confidence”: 0.00047270412751516944 }, { “name”: “person_enquiry”, “confidence”: 0.0004340739228840122 }, { “name”: “client_info”, “confidence”: 0.00019387738146468518 }, { “name”: “project_usecase”, “confidence”: 0.00017904953511428266 } ], “text”: “owner” }

case 2-> user input -> “owners”

debugged log -> [‘owner’]

{ “intent”: { “name”: “owner”, “confidence”: 0.9253092600222103 }, “entities”: [], “intent_ranking”: [ { “name”: “owner”, “confidence”: 0.9253092600222103 }, { “name”: “client_info”, “confidence”: 0.057077033004912105 }, { “name”: “inform”, “confidence”: 0.007490610202169739 }, { “name”: “greet”, “confidence”: 0.003935998500918612 }, { “name”: “out_of_scope”, “confidence”: 0.0026231646658204386 }, { “name”: “thank_you”, “confidence”: 0.0016098162847478703 }, { “name”: “person_enquiry”, “confidence”: 0.0010423624130542462 }, { “name”: “project_usecase”, “confidence”: 0.000911754906166398 } ], “text”: “owners” }

As you can see from here, the confidence score for “owner” is 0.9947196496583003 confidence score for “owners” is 0.9253092600222103

why is there a differemce in confidence score. Am I am proceeding correctly or is there anything that needs to be changed in code. Can someone comment on this.

And my pipeline is, language: “en”

pipeline:

name: “SpacyNLP”
name: “SpacyTokenizer”
name: “SpacyFeaturizer”
name: “CRFEntityExtractor”
name: “EntitySynonymMapper”
name: “SklearnIntentClassifier”

akelad · September 26, 2019, 12:56am

Hi @Vighnesh “owner” and “owners” are two different words, so that’s expected - I don’t think your custom component has anything to do with that

Topic		Replies	Views
Lemmatization & Punctuations Rasa Open Source	9	3283	September 25, 2019
Spacy alpha tokenization language support Getting Started with Rasa	1	137	January 18, 2019
How to use lemma, POS and dependency of SpaCy in RASA NLU Rasa Open Source	3	925	April 13, 2019
NLU entity position misalignment due to custom Lemmatization Preprocessing Rasa Open Source	0	660	July 24, 2019
Rasa com Rasa Open Source	13	1573	April 24, 2020

Able to lemmatize by modifying spacy_tokenizer, but the output confidence is differing for the same stem word

Related topics