I am trying to understand the NLU Pipeline at a detailed level. Inside SpacyTokenizer class there is a requirement for spacy_doc but I am not able to understand where is spacy_doc is defined in the pipeline and what it provides. Can anyone help me? Thanks.
I don’t know too much either but i can maybe direct you in the right direction.
In order to tokenize a sentence (splitting a sentence into it’s parts) we can use the spacy_tokenizer. Since the spacy_tokernizer requires the spacy module and a language model (en_core_sm or similar) we need to be sure that these are available to us. Thus the requirement spacy_doc.
There exists a SpacyNLP which lods the model/language for us.
It is loaded here in the spacy_utils file. (see: provides = [“spacy_doc”, “spacy_nlp”])
A Doc consists of many pieces including the tokens (it is a container class). To read more about the doc-object you can go to the documentation of spacy here