Which one is better for normalization,,,spacy tokenizer or whitespace tokenizer?or can we use them both in pipline

faiza_conte · October 28, 2020, 4:46pm

which one is better for normalization,spacy tokenizer or whitespace tokenizer???or can we use them both in pipline

rctatman · October 29, 2020, 8:38pm

You only want to use one tokenizer per pipeline. The whitespace tokenizer creates a new token every time it runs into white space. The SpaCy tokenizer adds some additional, language-specific rule checking after the white space splitting (https://spacy.io/usage/linguistic-features#tokenization). Which tokenizer you use will depend on the rest of your pipeline: different pipeline components rely on different tokenizers.

faiza_conte · October 29, 2020, 8:40pm

@rctatman assume am creating a training data from scratch in english, then which one would you suggest??

rctatman · October 29, 2020, 8:52pm

If you’re planning on using SpaCy at all (which I would, it’s a great library), use their tokenizer.

faiza_conte · October 29, 2020, 8:53pm

so you are saying using spacy tokenizer is better than whitespace tokenizer??..well oky thanks i will try it out

Topic		Replies	Views
Tokenizer_spacy uses punctuation as tokens? Rasa Open Source	1	526	January 18, 2019
Specify component input in RASA NLU Rasa Open Source	1	710	June 18, 2019
Cannot retrieve path from "tokenizer_whitespace" Rasa Open Source	1	254	April 12, 2021
Tokenizer for language without space such as Japanese Rasa Open Source	7	2052	October 20, 2020
RASA issue with SpacyTokenizer Rasa Open Source	3	1091	February 13, 2022

Which one is better for normalization,,,spacy tokenizer or whitespace tokenizer?or can we use them both in pipline

Related topics