Ukraine spacy language model

Dmitriy · July 19, 2021, 11:26am

Hello everyone. Can anybody tell us, what about ukraine language model for spacy. When it will be available for downloading? Thanks

souvikg10 · July 19, 2021, 12:25pm

Hello, I don’t think Spacy Ukraine Language model is available

You could try fasttext though

github.com

facebookresearch/fastText/blob/master/docs/crawl-vectors.md#models

---
id: crawl-vectors
title: Word vectors for 157 languages
---

We distribute pre-trained word vectors for 157 languages, trained on [*Common Crawl*](http://commoncrawl.org/) and [*Wikipedia*](https://www.wikipedia.org) using fastText.
These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
We also distribute three new word analogy datasets, for French, Hindi and Polish.

### Download directly with command line or from python

In order to download with command line or from python code, you must have installed the python package as [described here](/docs/en/support.html#building-fasttext-python-module).

<!--DOCUSAURUS_CODE_TABS-->
<!--Command line-->
```bash
$ ./download_model.py en     # English
Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
 (19.78%) [=========>                                         ]
```

This file has been truncated. show original

there is a ukrainian model with crawl data.

Check here, how to use it with rasa by installing rasa-nlu-examples as a dependency https://rasahq.github.io/rasa-nlu-examples/docs/featurizer/fasttext/

Dmitriy · July 22, 2021, 3:01pm

Thank’s for the answers. But the model is very big 7Gb. Is there any small models, for testing? Where I can download it?

Dmitriy · July 22, 2021, 3:02pm

Rasa crash on training, I think it because is not enough RAM memory

souvikg10 · July 22, 2021, 4:18pm

Most likely. Technically you could use the DIET classifier. You don’t really need a pre trained model for it.

You can potentially prune the fast text model vectors by importing it into spacy and that should reduce the size, I did that but like 3 years ago I am not sure how it works with spacy 3.0

Dmitriy · July 23, 2021, 12:43pm

We added fasttext model for Russian language and compare quality of intent detection… Config:

language: ru

pipeline:
  # - name: SpacyNLP
  #   model: "ru_core_news_lg"
  #   case_sensitive: False
  - name: rasa_nlu_examples.featurizers.dense.FastTextFeaturizer
    cache_dir: downloaded
    file: cc.ru.300.bin
  - name: "WhitespaceTokenizer"
  - name: "LexicalSyntacticFeaturizer"
  - name: rasa_nlu_examples.meta.Printer
    alias: printer after
  - name: "CountVectorsFeaturizer"
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: "DIETClassifier"
    epochs: 100
  - name: FallbackClassifier
    threshold: 0.5
  - name: "EntitySynonymMapper"
  - name: ResponseSelector
    epochs: 100

Dmitriy · July 23, 2021, 12:45pm

Result of compare, spacy vs fasttext not good. Spacy works better, than fasttext. When use fasttext there many fallback intents. What can we do to get more quality intent detection with fasttext? Thank’s

Dmitriy · July 23, 2021, 12:46pm

souvikg10 · July 23, 2021, 3:15pm

FastText was trained on CommonCrawl and Wikipedia while spaCy russian is trained on Nerus(Nerus is a large silver standard Russian corpus annotated with POS tags, syntax trees and NER tags (PER, LOC, ORG). GitHub - natasha/nerus: Large silver standart Russian corpus with NER, morphology and syntax markup

AS they mention, the standard is silver so the quality of the annotation is better.

I suppose you can add new tokens to create a model using FastText but obviously you should provide labelled data. It would be the same as training spaCy.

I have a very old codebase, where i imported fastText models onto spaCy, where you can add further data to train a spaCy model by improving upon the fastText vectors.

worth trying…

Topic		Replies	Views
How to train Rasa for other language Rasa Open Source	32	4916	August 25, 2020
Loading FastText and other language models Rasa Open Source	4	958	April 23, 2024
Rasa with foreign language: finnish language Rasa Open Source	2	1389	January 13, 2022
Dense word-embeddings with RASA (spaCy) Rasa Open Source	4	932	February 4, 2021
Foreign language (not English) problem Getting Started with Rasa	6	344	January 28, 2021

Ukraine spacy language model

Related topics