which one is better for normalization,spacy tokenizer or whitespace tokenizer???or can we use them both in pipline
Which one is better for normalization,,,spacy tokenizer or whitespace tokenizer?or can we use them both in pipline
You only want to use one tokenizer per pipeline. The whitespace tokenizer creates a new token every time it runs into white space. The SpaCy tokenizer adds some additional, language-specific rule checking after the white space splitting (https://spacy.io/usage/linguistic-features#tokenization). Which tokenizer you use will depend on the rest of your pipeline: different pipeline components rely on different tokenizers.
@rctatman assume am creating a training data from scratch in english, then which one would you suggest??
If you’re planning on using SpaCy at all (which I would, it’s a great library), use their tokenizer.
so you are saying using spacy tokenizer is better than whitespace tokenizer??..well oky thanks i will try it out