Questions of Rasa with Spacy

  1. How to specify user dictionary of spacy for non-English language, you know the performance of tokenizer will affect the whole nlp process, such as entity extraction. for example, we specify it for Chinese language
  1. How to specify custom entities in spacy for rasa.
    import spacy
    nlp = spacy.load('zh_core_web_sm')
    nlp.tokenizer.pkuseg_update_user_dict(['yyds', 'cx-4'])
    ruler = nlp.add_pipe("entity_ruler")
    patterns = [
        {"label": "net_hot_word", "pattern": "yyds"},
        {"label": "car_name", "pattern": "cx-4"}
1 Like

Hi,do you have solutions? I want to do this also. I am trying to rewrite the spacytokenizer in rasa.nlu.tokenizer. However, I do not find the way how rasa use the spacyNLP. I am confuse for the inputing message.

class SpacyTokenizer(Tokenizer):
    def get_doc(self, message: Message, attribute: Text) -> Optional["Doc"]:
        return message.get(SPACY_DOCS[attribute])

    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
        doc = self.get_doc(message, attribute)
        print('doc: ')
        if not doc:
            return []

        tokens = [
                t.text, t.idx, lemma=t.lemma_, data={POS_TAG_KEY: self._tag_of_token(t)}
            for t in doc
            if t.text and t.text.strip()

        return self._apply_token_pattern(tokens)

I don’t know how this load spacynlp model

Because the code for spacy to load the language model is not in the tokenizer, you should not modify the SpacyTokenizer to customize the user dictionary.

In fact, the language model is loaded in the SpacyNLP node, so you can consider inheriting rasa.nlu.utils.spacy_utils.SpacyModel and overloading the load_model static method. The reference code is as follows:

class KefuSpacyNlp(SpacyNLP):
    def load_model(spacy_model_name: Text) -> SpacyModel:
        _spacy_model = SpacyNLP.load_model(spacy_model_name)
        _spacy_model.model.tokenizer.pkuseg_update_user_dict(["word1", "word2", ...])
        return _spacy_model