How to specify user dictionary of spacy for non-English language, you know the performance of tokenizer will affect the whole nlp process, such as entity extraction.
for example, we specify it for Chinese language
Hi,do you have solutions?
I want to do this also.
I am trying to rewrite the spacytokenizer in rasa.nlu.tokenizer. However, I do not find the way how rasa use the spacyNLP. I am confuse for the inputing message.
class SpacyTokenizer(Tokenizer):
def get_doc(self, message: Message, attribute: Text) -> Optional["Doc"]:
return message.get(SPACY_DOCS[attribute])
def tokenize(self, message: Message, attribute: Text) -> List[Token]:
doc = self.get_doc(message, attribute)
print('doc: ')
print(doc)
if not doc:
return []
tokens = [
Token(
t.text, t.idx, lemma=t.lemma_, data={POS_TAG_KEY: self._tag_of_token(t)}
)
for t in doc
if t.text and t.text.strip()
]
return self._apply_token_pattern(tokens)
Because the code for spacy to load the language model is not in the tokenizer, you should not modify the SpacyTokenizer to customize the user dictionary.
In fact, the language model is loaded in the SpacyNLP node, so you can consider inheriting rasa.nlu.utils.spacy_utils.SpacyModel and overloading the load_model static method.
The reference code is as follows: