we have some problems with classifying the entity type for recognized values. Hence we’d like to use different configurations of the extractor. I cannot find anything about the “features” configuration besides everything here:
While this gives some information I still don’t know anything specific about the meaning of the different keys. Can someone point me in the right direction?
We will update the documentation soon to add the missing explanations. For now, please see the table below.
Feature Name Description
low Checks if the token is lower case.
upper Checks if the token is upper case.
title Checks if the token starts with an uppercase character and all remaining
characters are lowercased.
digit Checks if the token contains just digits.
prefix5 Take the first five characters of the token.
prefix2 Take the first two characters of the token.
suffix5 Take the last five characters of the token.
suffix3 Take the last three characters of the token.
suffix2 Take the last two characters of the token.
suffix1 Take the last character of the token.
pos Take the Part-of-Speech tag of the token (SpaCy required).
pos2 Take the first two characters of the Part-of-Speech tag of the token
pattern Take the patterns defined by ``RegexFeaturizer``.
bias Add an additional "bias" feature to the list of features.