Understanding spacy_doc in SpacyTokenizer class

Rajskc · January 12, 2019, 10:20am

Hi all,

I am trying to understand the NLU Pipeline at a detailed level. Inside SpacyTokenizer class there is a requirement for spacy_doc but I am not able to understand where is spacy_doc is defined in the pipeline and what it provides. Can anyone help me? Thanks.

alexweek · May 14, 2019, 12:22pm

I ran into the problem, too, did you fix it?

Rajskc · May 21, 2019, 7:53am

No I didn’t.

tomtomtom · May 21, 2019, 10:31am

Continuing the discussion from Understanding spacy_doc in SpacyTokenizer class:

I don’t know too much either but i can maybe direct you in the right direction.

In order to tokenize a sentence (splitting a sentence into it’s parts) we can use the spacy_tokenizer. Since the spacy_tokernizer requires the spacy module and a language model (en_core_sm or similar) we need to be sure that these are available to us. Thus the requirement spacy_doc. There exists a SpacyNLP which lods the model/language for us. It is loaded here in the spacy_utils file. (see: provides = [“spacy_doc”, “spacy_nlp”])

github.com

RasaHQ/rasa/blob/main/rasa/nlu/utils/spacy_utils.py#L17


      
          
          
from rasa.engine.graph import ExecutionContext, GraphComponent
          from rasa.engine.recipes.default_recipe import DefaultV1Recipe
          from rasa.engine.storage.resource import Resource
          from rasa.engine.storage.storage import ModelStorage
          from rasa.nlu.constants import DENSE_FEATURIZABLE_ATTRIBUTES, SPACY_DOCS
          from rasa.shared.nlu.training_data.message import Message
          from rasa.shared.nlu.training_data.training_data import TrainingData
          from rasa.nlu.model import InvalidModelError
          from rasa.shared.constants import DOCS_URL_COMPONENTS
          
          
logger = logging.getLogger(__name__)
          
          
if typing.TYPE_CHECKING:
              from spacy.language import Language  # noqa: F401
              from spacy.tokens import Doc
          
          

          
@dataclasses.dataclass
          class SpacyModel:
              """Wraps `SpacyNLP` output to make it fingerprintable."""

A Doc consists of many pieces including the tokens (it is a container class). To read more about the doc-object you can go to the documentation of spacy here

Best regards, Thomas

Topic		Replies	Views
RASA issue with SpacyTokenizer Rasa Open Source	3	1116	February 13, 2022
Custom spaCy language model, which parts do I need to train? Rasa Open Source	2	1242	July 15, 2019
Spacy Language module Issue : missing en_core_web_sm/tokenizer Getting Started with Rasa	1	107	May 21, 2019
Spacy alpha tokenization language support Getting Started with Rasa	1	152	January 18, 2019
Questions of Rasa with Spacy Rasa Open Source	2	376	November 23, 2023

Understanding spacy_doc in SpacyTokenizer class

Related topics