I have a Japanese robot trained with CRF NE model. I works well on evaluation of a split of training-testing data. However, I tested a few cases with new names to replace the existing names in the training data, it generalize little. Examples:
[福原愛]に電話をかけてください。// I replace '福原愛' with another name '三口百惠'，and it didn't work
[消防署]に電話。// I replace '消防署' with '消防', and it didn't work.
Does the ‘name’ has to occur at least once in the training data?
Any feature manipulation can improve this?
No, not every entity has to appear to in the training data. The question how will it generalizes depends on how strict your pattern is, if the same keywords appear and so on.
This might be really helpfull
@IgNoRaNt23, in the blog, it says " To use regular expressions and / or lookup tables add the intent_entity_featurizer_regex component before the ner_crf component in your pipeline."
What is “intent_entity_featurizer_regex”? Is it the same as ‘RegexFeaturizer’? Is it necessary when you lookup table?
Hey @twittmin. Yes,
intent_entity_featurizer_regex was renamed to
RegexFeaturizer. Yep, you should have this component if you are using lookup tables, because it’s one of the components which are used to extract the patterns.