Hello everyone. Can anybody tell us, what about ukraine language model for spacy. When it will be available for downloading? Thanks
Hello, I don’t think Spacy Ukraine Language model is available
You could try fasttext though
there is a ukrainian model with crawl data.
Check here, how to use it with rasa by installing rasa-nlu-examples as a dependency https://rasahq.github.io/rasa-nlu-examples/docs/featurizer/fasttext/
Thank’s for the answers. But the model is very big 7Gb. Is there any small models, for testing? Where I can download it?
Rasa crash on training, I think it because is not enough RAM memory
Most likely. Technically you could use the DIET classifier. You don’t really need a pre trained model for it.
You can potentially prune the fast text model vectors by importing it into spacy and that should reduce the size, I did that but like 3 years ago I am not sure how it works with spacy 3.0
We added fasttext model for Russian language and compare quality of intent detection… Config:
language: ru pipeline: # - name: SpacyNLP # model: "ru_core_news_lg" # case_sensitive: False - name: rasa_nlu_examples.featurizers.dense.FastTextFeaturizer cache_dir: downloaded file: cc.ru.300.bin - name: "WhitespaceTokenizer" - name: "LexicalSyntacticFeaturizer" - name: rasa_nlu_examples.meta.Printer alias: printer after - name: "CountVectorsFeaturizer" analyzer: "char_wb" min_ngram: 1 max_ngram: 4 - name: "DIETClassifier" epochs: 100 - name: FallbackClassifier threshold: 0.5 - name: "EntitySynonymMapper" - name: ResponseSelector epochs: 100
Result of compare, spacy vs fasttext not good. Spacy works better, than fasttext. When use fasttext there many fallback intents. What can we do to get more quality intent detection with fasttext? Thank’s
FastText was trained on CommonCrawl and Wikipedia while spaCy russian is trained on Nerus(Nerus is a large silver standard Russian corpus annotated with POS tags, syntax trees and NER tags (PER, LOC, ORG). GitHub - natasha/nerus: Large silver standart Russian corpus with NER, morphology and syntax markup
AS they mention, the standard is silver so the quality of the annotation is better.
I suppose you can add new tokens to create a model using FastText but obviously you should provide labelled data. It would be the same as training spaCy.
I have a very old codebase, where i imported fastText models onto spaCy, where you can add further data to train a spaCy model by improving upon the fastText vectors.