Understanding spacy_doc in SpacyTokenizer class

Hi all,

I am trying to understand the NLU Pipeline at a detailed level. Inside SpacyTokenizer class there is a requirement for spacy_doc but I am not able to understand where is spacy_doc is defined in the pipeline and what it provides. Can anyone help me? Thanks.

1 Like

I ran into the problem, too, did you fix it?

No I didn’t.

Continuing the discussion from Understanding spacy_doc in SpacyTokenizer class:

I don’t know too much either but i can maybe direct you in the right direction.

In order to tokenize a sentence (splitting a sentence into it’s parts) we can use the spacy_tokenizer. Since the spacy_tokernizer requires the spacy module and a language model (en_core_sm or similar) we need to be sure that these are available to us. Thus the requirement spacy_doc. There exists a SpacyNLP which lods the model/language for us. It is loaded here in the spacy_utils file. (see: provides = [“spacy_doc”, “spacy_nlp”])

A Doc consists of many pieces including the tokens (it is a container class). To read more about the doc-object you can go to the documentation of spacy here

Best regards, Thomas