SpaCy, Gensim, FastText and BytePair for non-english language


What’s basically the core practical difference between using Spacy and using gensim, fastext, bytepair for featurizer. I was told before by @koaning that this benchmark just contrib.



There are mainly algorithmic difference in how they were created and what datasets they were trained on. Whether or not they are useful to your pipeline is best confirmed via a benchmark. It’s hard to say upfront.

In short, I’ll list a summary of some of the main differences.

  1. spaCy currently does GloVE (algorithm whiteboard link) to the best of my knowledge (I could be wrong), but they have an internal trick that similar words point to the same embedding to save on memory. They train on news data but they’re available in only a few languages.
  2. Fasttext (algorithm whiteboard link 1 and algorithm whiteboard link 2) is super heavy but it also encodes ngram-embeddings so that it is more robust against out of vocabulary words. These are available in 157 languages but can be 7GB in size.
  3. BytePair embeddings (algorithm whiteboard link) do something similar as fasttext but they are more picky about which ngrams to actually keep. This makes them much lighter. I believe these are trained on wikipedia and they are available and are available in 275 languages. You can also customise the dimensions/vocab size a bit more if you’d like.

Thanks a lot,

I think i will stick with BytePair for now.