Able to lemmatize by modifying spacy_tokenizer, but the output confidence is differing for the same stem word

Hi, I had modified the spacy_tokenizer.py file to lemmatize the user inputs and to remove stop words. File, import typing from typing import Any

from rasa.nlu.components import Component from rasa.nlu.config import RasaNLUModelConfig from rasa.nlu.tokenizers import Token, Tokenizer from rasa.nlu.training_data import Message, TrainingData

from rasa.nlu.constants import ( MESSAGE_RESPONSE_ATTRIBUTE, MESSAGE_INTENT_ATTRIBUTE, MESSAGE_TEXT_ATTRIBUTE, MESSAGE_TOKENS_NAMES, MESSAGE_ATTRIBUTES, MESSAGE_SPACY_FEATURES_NAMES, MESSAGE_VECTOR_FEATURE_NAMES, )

if typing.TYPE_CHECKING: from spacy.tokens.doc import Doc # pytype: disable=import-error import spacy import re nlp = spacy.load(‘en’) stop_words =[‘ours’, ‘keep’, ‘in’, ‘enough’, ‘anything’, ‘latterly’ , ‘thereupon’, ‘your’, ‘if’, ‘as’, ‘each’, ‘his’, ‘but’ , ‘everywhere’, ‘hereupon’, ‘being’, ‘becoming’, ‘and’, ‘anyhow’, ‘serious’, ‘something’, ‘latter’, ‘namely’, ‘name’, ‘seemed’, ‘yourselves’, ‘toward’, ‘must’, ‘same’, ‘then’, ‘become’, ‘while’, ‘becomes’, ‘ourselves’, ‘perhaps’, ‘or’, ‘more’, ‘whose’, ‘along’, ‘own’, ‘thence’, ‘had’, ‘itself’, ‘top’, ‘whether’, ‘beside’, ‘into’, ‘on’, ‘per’, ‘whole’, ‘one’, ‘towards’, ‘himself’, ‘against’, ‘beyond’, ‘off’, ‘done’, ‘are’, ‘you’, ‘he’, ‘yours’, ‘an’, ‘myself’, ‘themselves’, ‘hereafter’, ‘else’, ‘have’, ‘neither’, ‘again’, ‘afterwards’, ‘under’, ‘its’, ‘due’, ‘always’, ‘be’, ‘over’, ‘therefore’, ‘very’, ‘at’, ‘during’, ‘nobody’, ‘where’, ‘whoever’, ‘across’, ‘thereafter’, ‘i’, ‘thereby’, ‘empty’, ‘move’, ‘put’, ‘through’, ‘since’, ‘my’, ‘wherein’, ‘became’, ‘thus’, ‘none’, ‘cannot’, ‘did’, ‘next’, ‘above’, ‘regarding’, ‘to’, ‘too’, ‘within’, ‘just’, ‘nothing’, ‘now’, ‘am’, ‘part’, ‘seems’, ‘than’, ‘alone’, ‘after’, ‘once’, ‘doing’, ‘otherwise’, ‘who’, ‘indeed’, ‘full’, ‘whence’, ‘before’, ‘how’, ‘although’, ‘mostly’, ‘take’, ‘between’, ‘these’, ‘whereas’, ‘former’, ‘whom’, ‘many’, ‘amongst’, ‘other’, ‘ca’, ‘besides’, ‘go’, ‘much’, ‘may’, ‘nowhere’, ‘together’, ‘him’, ‘her’, ‘there’, ‘say’, ‘throughout’, ‘whereby’, ‘mine’, ‘formerly’, ‘only’, ‘really’, ‘herein’, ‘show’, ‘might’, ‘hers’, ‘often’, ‘when’, ‘whereupon’, ‘those’, ‘rather’, ‘somewhere’, ‘give’, ‘here’, ‘do’, ‘used’, ‘does’, ‘me’, ‘seem’, ‘unless’, ‘sometime’, ‘almost’, ‘via’, ‘back’, ‘hereby’, ‘few’, ‘all’, ‘up’, ‘using’, ‘should’, ‘well’, ‘see’, ‘been’, ‘various’, ‘yourself’, ‘bottom’, ‘onto’, ‘side’, ‘for’, ‘everyone’, ‘will’, ‘several’, ‘however’, ‘meanwhile’, ‘can’, ‘everything’, ‘around’, ‘she’, ‘of’, ‘their’, ‘were’, ‘get’, ‘until’, ‘that’, ‘yet’, ‘already’, ‘both’, ‘by’, ‘somehow’, ‘any’, ‘please’, ‘whereafter’, ‘behind’, ‘therein’, ‘the’, ‘they’, ‘whenever’, ‘out’, ‘still’, ‘our’, ‘most’, ‘least’, ‘though’, ‘with’, ‘a’, ‘could’, ‘such’, ‘less’, ‘was’, ‘nor’, ‘others’, ‘why’, ‘about’, ‘never’, ‘so’, ‘us’, ‘wherever’, ‘beforehand’, ‘moreover’, ‘last’, ‘among’, ‘elsewhere’, ‘nevertheless’, ‘quite’, ‘upon’, ‘ever’, ‘anywhere’, ‘we’, ‘down’, ‘what’, ‘amount’, ‘whither’, ‘it’, ‘below’, ‘someone’, ‘either’, ‘is’, ‘some’, ‘even’, ‘also’, ‘from’, ‘except’, ‘further’, ‘herself’, ‘make’, ‘which’, ‘this’, ‘call’, ‘without’, ‘made’, ‘re’, ‘sometimes’, ‘another’, ‘whatever’, ‘anyone’, ‘would’, ‘every’, ‘thru’, ‘them’, ‘anyway’, ‘hence’, ‘has’, ‘because’, ‘seeming’,“what’s”,“whats”,’-PRON-’,‘iam’, ‘im’,“i’m”,“what’s”,“whats”,‘am’] class SpacyTokenizer(Tokenizer, Component):

name = "tokenizer_spacy_lemma"
provides = ["tokens"]
requires = ["spacy_doc"]

def train(self,
          training_data: TrainingData,
          config: RasaNLUModelConfig,
          **kwargs: Any)-> None:

    for example in training_data.training_examples:
        example.set("tokens", self.tokenize(example.get("spacy_doc")))

def process(self, message: Message, **kwargs: Any)-> None:
    #message = nlp(message)
    print("********************")
    print(message)
    print(message.get("spacy_doc"))
    print(message.text)
    message.set("tokens", self.tokenize(message.text))

def tokenize(self, doc):
    doc=str(doc)
    words = re.sub(r'[.,!?]+(\s|$)', ' ', doc).split()
    print(type(doc))
    toq = [tok for tok in words if not tok in stop_words]
    doc1 = nlp(str(' '.join(toq)))
    words = [str(lemm.lemma_) for lemm in doc1]
    words = [re.sub(r'[^\x00-\x7f]','',re.sub('[\t\r\n,)([\]!%|!#$%&*+,.-/:;<=>?@^_`{|}~?]','',str(i))).strip() for i in words]
    tokens = []
    texts = ' '.join(words)
    running_offset = 0
    print(words)
    for word in words:
        word_offset = texts.index(word, running_offset)
        word_len = len(word)
        running_offset = word_offset + word_len
        tokens.append(Token(word, word_offset))
    print(tokens)
    return tokens

my nlu.md,

And when I debug in rasa shell nlu it I got the following results,

case 1 -> user input -> “owner”

shell output -> debugged log [‘owner’] { “intent”: { “name”: “owner”, “confidence”: 0.9947196496583003 }, “entities”: [], “intent_ranking”: [ { “name”: “owner”, “confidence”: 0.9947196496583003 }, { “name”: “out_of_scope”, “confidence”: 0.001791049208563651 }, { “name”: “thank_you”, “confidence”: 0.001411993969675 }, { “name”: “greet”, “confidence”: 0.0007976021964830285 }, { “name”: “inform”, “confidence”: 0.00047270412751516944 }, { “name”: “person_enquiry”, “confidence”: 0.0004340739228840122 }, { “name”: “client_info”, “confidence”: 0.00019387738146468518 }, { “name”: “project_usecase”, “confidence”: 0.00017904953511428266 } ], “text”: “owner” }

case 2-> user input -> “owners”

debugged log -> [‘owner’]

{ “intent”: { “name”: “owner”, “confidence”: 0.9253092600222103 }, “entities”: [], “intent_ranking”: [ { “name”: “owner”, “confidence”: 0.9253092600222103 }, { “name”: “client_info”, “confidence”: 0.057077033004912105 }, { “name”: “inform”, “confidence”: 0.007490610202169739 }, { “name”: “greet”, “confidence”: 0.003935998500918612 }, { “name”: “out_of_scope”, “confidence”: 0.0026231646658204386 }, { “name”: “thank_you”, “confidence”: 0.0016098162847478703 }, { “name”: “person_enquiry”, “confidence”: 0.0010423624130542462 }, { “name”: “project_usecase”, “confidence”: 0.000911754906166398 } ], “text”: “owners” }

As you can see from here, the confidence score for “owner” is 0.9947196496583003 confidence score for “owners” is 0.9253092600222103

why is there a differemce in confidence score. Am I am proceeding correctly or is there anything that needs to be changed in code. Can someone comment on this.

And my pipeline is, language: “en”

pipeline:

  • name: “SpacyNLP”
  • name: “SpacyTokenizer”
  • name: “SpacyFeaturizer”
  • name: “CRFEntityExtractor”
  • name: “EntitySynonymMapper”
  • name: “SklearnIntentClassifier”

Hi @Vighnesh “owner” and “owners” are two different words, so that’s expected - I don’t think your custom component has anything to do with that