SpaCy, Gensim, FastText and BytePair for non-english language

welly87 · August 26, 2020, 2:11am

Hi,

What’s basically the core practical difference between using Spacy and using gensim, fastext, bytepair for featurizer. I was told before by @koaning that this benchmark just contrib.

Thought?

Cheers

koaning · August 26, 2020, 10:32am

There are mainly algorithmic difference in how they were created and what datasets they were trained on. Whether or not they are useful to your pipeline is best confirmed via a benchmark. It’s hard to say upfront.

In short, I’ll list a summary of some of the main differences.

spaCy currently does GloVE (algorithm whiteboard link) to the best of my knowledge (I could be wrong), but they have an internal trick that similar words point to the same embedding to save on memory. They train on news data but they’re available in only a few languages.
Fasttext (algorithm whiteboard link 1 and algorithm whiteboard link 2) is super heavy but it also encodes ngram-embeddings so that it is more robust against out of vocabulary words. These are available in 157 languages but can be 7GB in size.
BytePair embeddings (algorithm whiteboard link) do something similar as fasttext but they are more picky about which ngrams to actually keep. This makes them much lighter. I believe these are trained on wikipedia and they are available and are available in 275 languages. You can also customise the dimensions/vocab size a bit more if you’d like.

welly87 · August 26, 2020, 10:45am

Thanks a lot,

I think i will stick with BytePair for now.

Cheers

Topic		Replies	Views
How to train Rasa for other language Rasa Open Source	32	4924	August 25, 2020
Decision about using a pre-trained words embeddings or not Getting Started with Rasa	3	169	June 3, 2020
Ukraine spacy language model Rasa Open Source	8	1464	July 23, 2021
Foreign language (not English) problem Getting Started with Rasa	6	347	January 28, 2021
How to use glove.txt matrix with SpacyFeaturizer Rasa Open Source	1	219	October 8, 2021

SpaCy, Gensim, FastText and BytePair for non-english language

Related topics