Hi @JulianGerhard,
This was actually a tricky case in the implementation. The reason why non-existent attribute was given an empty string was that spacy doesn’t accept None
as the input string. Since we are processing spacy docs for training examples in batch mode - self.nlp.pipe(texts, batch_size=50)
(much faster to do it in batch mode), I replaced all None
attributes with empty strings. It would be tedious and messy to filter examples with None
value for an attribute and then merge an empty doc for them.
The way how you have implemented it currently has a small bug -
docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)]
attribute_docs[attribute] = docs
docs
will be the spacy docs for a filtered set of texts, which means the order of the spacy docs of training examples would now be different from the order of training examples in training_data.training_examples
. This would cause a problem here -
for idx, example in enumerate(training_data.training_examples):
example_attribute_doc = attribute_docs[attribute][idx]
if len(example_attribute_doc):
# If length is 0, that means the initial text feature was None and was replaced by ''
# in preprocess method
example.set(
MESSAGE_SPACY_FEATURES_NAMES[attribute], example_attribute_doc
)
attribute_docs[attribute][idx]
does not correspond to the correct spacy doc for training example at idx
index inside training_data.training_examples
.
I haven’t looked at spacy-pytorch-transformers library myself, but do you have any other idea to avoid this?