Hi, I had modified the spacy_tokenizer.py file to lemmatize the user inputs and to remove stop words. File, import typing from typing import Any
from rasa.nlu.components import Component from rasa.nlu.config import RasaNLUModelConfig from rasa.nlu.tokenizers import Token, Tokenizer from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.constants import ( MESSAGE_RESPONSE_ATTRIBUTE, MESSAGE_INTENT_ATTRIBUTE, MESSAGE_TEXT_ATTRIBUTE, MESSAGE_TOKENS_NAMES, MESSAGE_ATTRIBUTES, MESSAGE_SPACY_FEATURES_NAMES, MESSAGE_VECTOR_FEATURE_NAMES, )
if typing.TYPE_CHECKING: from spacy.tokens.doc import Doc # pytype: disable=import-error import spacy import re nlp = spacy.load(âenâ) stop_words =[âoursâ, âkeepâ, âinâ, âenoughâ, âanythingâ, âlatterlyâ , âthereuponâ, âyourâ, âifâ, âasâ, âeachâ, âhisâ, âbutâ , âeverywhereâ, âhereuponâ, âbeingâ, âbecomingâ, âandâ, âanyhowâ, âseriousâ, âsomethingâ, âlatterâ, ânamelyâ, ânameâ, âseemedâ, âyourselvesâ, âtowardâ, âmustâ, âsameâ, âthenâ, âbecomeâ, âwhileâ, âbecomesâ, âourselvesâ, âperhapsâ, âorâ, âmoreâ, âwhoseâ, âalongâ, âownâ, âthenceâ, âhadâ, âitselfâ, âtopâ, âwhetherâ, âbesideâ, âintoâ, âonâ, âperâ, âwholeâ, âoneâ, âtowardsâ, âhimselfâ, âagainstâ, âbeyondâ, âoffâ, âdoneâ, âareâ, âyouâ, âheâ, âyoursâ, âanâ, âmyselfâ, âthemselvesâ, âhereafterâ, âelseâ, âhaveâ, âneitherâ, âagainâ, âafterwardsâ, âunderâ, âitsâ, âdueâ, âalwaysâ, âbeâ, âoverâ, âthereforeâ, âveryâ, âatâ, âduringâ, ânobodyâ, âwhereâ, âwhoeverâ, âacrossâ, âthereafterâ, âiâ, âtherebyâ, âemptyâ, âmoveâ, âputâ, âthroughâ, âsinceâ, âmyâ, âwhereinâ, âbecameâ, âthusâ, ânoneâ, âcannotâ, âdidâ, ânextâ, âaboveâ, âregardingâ, âtoâ, âtooâ, âwithinâ, âjustâ, ânothingâ, ânowâ, âamâ, âpartâ, âseemsâ, âthanâ, âaloneâ, âafterâ, âonceâ, âdoingâ, âotherwiseâ, âwhoâ, âindeedâ, âfullâ, âwhenceâ, âbeforeâ, âhowâ, âalthoughâ, âmostlyâ, âtakeâ, âbetweenâ, âtheseâ, âwhereasâ, âformerâ, âwhomâ, âmanyâ, âamongstâ, âotherâ, âcaâ, âbesidesâ, âgoâ, âmuchâ, âmayâ, ânowhereâ, âtogetherâ, âhimâ, âherâ, âthereâ, âsayâ, âthroughoutâ, âwherebyâ, âmineâ, âformerlyâ, âonlyâ, âreallyâ, âhereinâ, âshowâ, âmightâ, âhersâ, âoftenâ, âwhenâ, âwhereuponâ, âthoseâ, âratherâ, âsomewhereâ, âgiveâ, âhereâ, âdoâ, âusedâ, âdoesâ, âmeâ, âseemâ, âunlessâ, âsometimeâ, âalmostâ, âviaâ, âbackâ, âherebyâ, âfewâ, âallâ, âupâ, âusingâ, âshouldâ, âwellâ, âseeâ, âbeenâ, âvariousâ, âyourselfâ, âbottomâ, âontoâ, âsideâ, âforâ, âeveryoneâ, âwillâ, âseveralâ, âhoweverâ, âmeanwhileâ, âcanâ, âeverythingâ, âaroundâ, âsheâ, âofâ, âtheirâ, âwereâ, âgetâ, âuntilâ, âthatâ, âyetâ, âalreadyâ, âbothâ, âbyâ, âsomehowâ, âanyâ, âpleaseâ, âwhereafterâ, âbehindâ, âthereinâ, âtheâ, âtheyâ, âwheneverâ, âoutâ, âstillâ, âourâ, âmostâ, âleastâ, âthoughâ, âwithâ, âaâ, âcouldâ, âsuchâ, âlessâ, âwasâ, ânorâ, âothersâ, âwhyâ, âaboutâ, âneverâ, âsoâ, âusâ, âwhereverâ, âbeforehandâ, âmoreoverâ, âlastâ, âamongâ, âelsewhereâ, âneverthelessâ, âquiteâ, âuponâ, âeverâ, âanywhereâ, âweâ, âdownâ, âwhatâ, âamountâ, âwhitherâ, âitâ, âbelowâ, âsomeoneâ, âeitherâ, âisâ, âsomeâ, âevenâ, âalsoâ, âfromâ, âexceptâ, âfurtherâ, âherselfâ, âmakeâ, âwhichâ, âthisâ, âcallâ, âwithoutâ, âmadeâ, âreâ, âsometimesâ, âanotherâ, âwhateverâ, âanyoneâ, âwouldâ, âeveryâ, âthruâ, âthemâ, âanywayâ, âhenceâ, âhasâ, âbecauseâ, âseemingâ,âwhatâsâ,âwhatsâ,â-PRON-â,âiamâ, âimâ,âiâmâ,âwhatâsâ,âwhatsâ,âamâ] class SpacyTokenizer(Tokenizer, Component):
name = "tokenizer_spacy_lemma"
provides = ["tokens"]
requires = ["spacy_doc"]
def train(self,
training_data: TrainingData,
config: RasaNLUModelConfig,
**kwargs: Any)-> None:
for example in training_data.training_examples:
example.set("tokens", self.tokenize(example.get("spacy_doc")))
def process(self, message: Message, **kwargs: Any)-> None:
#message = nlp(message)
print("********************")
print(message)
print(message.get("spacy_doc"))
print(message.text)
message.set("tokens", self.tokenize(message.text))
def tokenize(self, doc):
doc=str(doc)
words = re.sub(r'[.,!?]+(\s|$)', ' ', doc).split()
print(type(doc))
toq = [tok for tok in words if not tok in stop_words]
doc1 = nlp(str(' '.join(toq)))
words = [str(lemm.lemma_) for lemm in doc1]
words = [re.sub(r'[^\x00-\x7f]','',re.sub('[\t\r\n,)([\]!%|!#$%&*+,.-/:;<=>?@^_`{|}~?]','',str(i))).strip() for i in words]
tokens = []
texts = ' '.join(words)
running_offset = 0
print(words)
for word in words:
word_offset = texts.index(word, running_offset)
word_len = len(word)
running_offset = word_offset + word_len
tokens.append(Token(word, word_offset))
print(tokens)
return tokens
my nlu.md,
- who is the owner for pv first
- leader of pv first
- who owns pv first
- who controls pv first
- who oreders pv first
- who is the owner for alsc
- leader of alsc
- who owns alsc
- who controls alsc
- who oreders alsc
- who is the owner for ucr
- leader of ucr
- who owns ucr
- who controls ucr
- who oreders ucr
- who is the owner for arw
- leader of arw
- who owns arw
- who controls arw
- who oreders arw
- who is the owner for coip
- leader of coip
- who owns coip
- who owns coip
- who owns coip
- who owns coip
- who controls coip
- who oreders coip
- who is the owner for cdisc
- leader of cdisc
- who owns cdisc
- who controls cdisc
- who oreders cdisc
And when I debug in rasa shell nlu it I got the following results,
case 1 -> user input -> âownerâ
shell output -> debugged log [âownerâ] { âintentâ: { ânameâ: âownerâ, âconfidenceâ: 0.9947196496583003 }, âentitiesâ: [], âintent_rankingâ: [ { ânameâ: âownerâ, âconfidenceâ: 0.9947196496583003 }, { ânameâ: âout_of_scopeâ, âconfidenceâ: 0.001791049208563651 }, { ânameâ: âthank_youâ, âconfidenceâ: 0.001411993969675 }, { ânameâ: âgreetâ, âconfidenceâ: 0.0007976021964830285 }, { ânameâ: âinformâ, âconfidenceâ: 0.00047270412751516944 }, { ânameâ: âperson_enquiryâ, âconfidenceâ: 0.0004340739228840122 }, { ânameâ: âclient_infoâ, âconfidenceâ: 0.00019387738146468518 }, { ânameâ: âproject_usecaseâ, âconfidenceâ: 0.00017904953511428266 } ], âtextâ: âownerâ }
case 2-> user input -> âownersâ
debugged log -> [âownerâ]
{ âintentâ: { ânameâ: âownerâ, âconfidenceâ: 0.9253092600222103 }, âentitiesâ: [], âintent_rankingâ: [ { ânameâ: âownerâ, âconfidenceâ: 0.9253092600222103 }, { ânameâ: âclient_infoâ, âconfidenceâ: 0.057077033004912105 }, { ânameâ: âinformâ, âconfidenceâ: 0.007490610202169739 }, { ânameâ: âgreetâ, âconfidenceâ: 0.003935998500918612 }, { ânameâ: âout_of_scopeâ, âconfidenceâ: 0.0026231646658204386 }, { ânameâ: âthank_youâ, âconfidenceâ: 0.0016098162847478703 }, { ânameâ: âperson_enquiryâ, âconfidenceâ: 0.0010423624130542462 }, { ânameâ: âproject_usecaseâ, âconfidenceâ: 0.000911754906166398 } ], âtextâ: âownersâ }
As you can see from here, the confidence score for âownerâ is 0.9947196496583003 confidence score for âownersâ is 0.9253092600222103
why is there a differemce in confidence score. Am I am proceeding correctly or is there anything that needs to be changed in code. Can someone comment on this.
And my pipeline is, language: âenâ
pipeline:
- name: âSpacyNLPâ
- name: âSpacyTokenizerâ
- name: âSpacyFeaturizerâ
- name: âCRFEntityExtractorâ
- name: âEntitySynonymMapperâ
- name: âSklearnIntentClassifierâ
@TatianaParshina @Ghostvv- Could you please comment on this