Rasa with foreign language: finnish language

ipuli · June 4, 2020, 11:52am

I try to implement RASA chabot with finnish language From here

I understood that I should use spacynlp, but when I look here

Finnish language don’t have ready made model but language data only?

Then somewhere were mentioned that you could use fasttext vectors

facebookresearch/fastText/blob/master/docs/crawl-vectors.md

---
id: crawl-vectors
title: Word vectors for 157 languages
---

We distribute pre-trained word vectors for 157 languages, trained on [*Common Crawl*](http://commoncrawl.org/) and [*Wikipedia*](https://www.wikipedia.org) using fastText.
These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
We also distribute three new word analogy datasets, for French, Hindi and Polish.

### Download directly with command line or from python

In order to download with command line or from python code, you must have installed the python package as [described here](/docs/en/support.html#building-fasttext-python-module).

<!--DOCUSAURUS_CODE_TABS-->
<!--Command line-->
```bash
$ ./download_model.py en     # English
Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
 (19.78%) [=========>                                         ]
```

This file has been truncated. show original

They have finnish language there but I don’t know how I can inlcude it in my RASA chatbot?

What is also the correct config, yml i.e. pipeline settings then?

koaning · June 10, 2020, 11:46am

Hi Pauli,

I’ve never used Finnish in ML before but I’ll try to offer some help.

If you end up training your own custom spaCy model then you may find this guide useful. It explains how to get a custom spaCy model into Rasa. I’m mentioning this because there are other packages out there that integrate with spaCy. For example there’s some BERT style models that you could have a look at (some of them, if I recall correctly, are multilingual).

This is just some background. I think for you right now there are two paths to consider.

Path 1: spaCy

There’s actually some light support for fasttext inside of spaCy! You can find a guide here. Technically you should be able to load this spaCy model into Rasa. I’ve never done this before but it should be possible.

Path 2: Custom Components

Fasttext embeddings are not natively supported inside of Rasa but I’ve written my own fasttext component for some of my own personal research. I am planning on open sourcing it at some point in time with proper tests but I’ll for the time being I’ll share the core code below. Assuming you’ve got this code in the root directory of your Rasa project in a file called fastfeatures.py.

import typing
from typing import Any, Optional, Text, Dict, List, Type
import fasttext
import numpy as np
import os

from rasa.nlu.components import Component
from rasa.nlu.featurizers.featurizer import DenseFeaturizer
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.tokenizers.tokenizer import Tokenizer

if typing.TYPE_CHECKING:
    from rasa.nlu.model import Metadata
from rasa.nlu.constants import DENSE_FEATURE_NAMES, DENSE_FEATURIZABLE_ATTRIBUTES, TEXT


class FastTextFeaturizer(DenseFeaturizer):
    """This component adds fasttext features."""

    @classmethod
    def required_components(cls) -> List[Type[Component]]:
        return [Tokenizer]

    @classmethod
    def required_packages(cls) -> List[Text]:
        return ["fasttext"]

    defaults = {"file": None, "cache_dir": None}
    language_list = None

    def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
        super().__init__(component_config)
        path = os.path.join(component_config["cache_dir"], component_config["file"])
        self.model = fasttext.load_model(path)

    def train(
        self,
        training_data: TrainingData,
        config: Optional[RasaNLUModelConfig] = None,
        **kwargs: Any,
    ) -> None:
        for example in training_data.intent_examples:
            for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
                self.set_fasttext_features(example, attribute)

    def set_fasttext_features(self, message: Message, attribute: Text = TEXT):
        text_vector = self.model.get_word_vector(message.text)
        word_vectors = [
            self.model.get_word_vector(t.text)
            for t in message.data["tokens"]
            if t.text != "__CLS__"
        ]
        X = np.array(word_vectors + [text_vector])  # remember, we need one for __CLS__

        features = self._combine_with_existing_dense_features(
            message, additional_features=X, feature_name=DENSE_FEATURE_NAMES[attribute]
        )
        message.set(DENSE_FEATURE_NAMES[attribute], features)

    def process(self, message: Message, **kwargs: Any) -> None:
        self.set_fasttext_features(message)

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        pass

    @classmethod
    def load(
        cls,
        meta: Dict[Text, Any],
        model_dir: Optional[Text] = None,
        model_metadata: Optional["Metadata"] = None,
        cached_component: Optional["Component"] = None,
        **kwargs: Any,
    ) -> "Component":
        """Load this component from file."""

        if cached_component:
            return cached_component
        else:
            return cls(meta)

Then you should be able to add it to your pipeline like so;

language: en

pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: fastfeatures.FastTextFeaturizer
  cache_dir: "<path>/<to>/<folder>"
  file: "cc.en.300.bin.gz"
- name: DIETClassifier
  epochs: 200

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

In your case you’d need to replace cc.en.300.bin.gz with the file for Finnish.

A few things to mention. This code runs locally on my machine using Rasa 1.10.0. As we move closer to Rasa 2.0 some internals might change and I just want to make sure that you’re aware of that. The goal is to host some custom component like this on github in the future but I want to make sure that the tools that we host are 100% Rasa 2.0 compatible.

I hope this helps.

JoukoSalonen · January 13, 2022, 9:32pm

Hi Pauli & Vincent As a Finn & native finnish language speaker I Just wanted to make an update on this topic. As we now have Finnish language Spacy here

And since the existing documentation and examples of using Spacy components are pretty old ( from3-5/2020) I’d propose a “how to” update?

Topic		Replies	Views
How to train Rasa for other language Rasa Open Source	32	4912	August 25, 2020
Foreign language (not English) problem Getting Started with Rasa	6	344	January 28, 2021
Word embeddings and RASA NLU Rasa Open Source	5	2029	August 10, 2020
Using fasttext pretrained word emmbeding for other language Rasa Open Source	3	1001	August 18, 2020
Dense word-embeddings with RASA (spaCy) Rasa Open Source	4	932	February 4, 2021

Rasa with foreign language: finnish language

Path 1: spaCy

Path 2: Custom Components

Related topics