Rasa with foreign language: finnish language

I try to implement RASA chabot with finnish language From here

I understood that I should use spacynlp, but when I look here

Finnish language don’t have ready made model but language data only?

Then somewhere were mentioned that you could use fasttext vectors

They have finnish language there but I don’t know how I can inlcude it in my RASA chatbot?

What is also the correct config, yml i.e. pipeline settings then?

Hi Pauli,

I’ve never used Finnish in ML before but I’ll try to offer some help.

If you end up training your own custom spaCy model then you may find this guide useful. It explains how to get a custom spaCy model into Rasa. I’m mentioning this because there are other packages out there that integrate with spaCy. For example there’s some BERT style models that you could have a look at (some of them, if I recall correctly, are multilingual).

This is just some background. I think for you right now there are two paths to consider.

Path 1: spaCy

There’s actually some light support for fasttext inside of spaCy! You can find a guide here. Technically you should be able to load this spaCy model into Rasa. I’ve never done this before but it should be possible.

Path 2: Custom Components

Fasttext embeddings are not natively supported inside of Rasa but I’ve written my own fasttext component for some of my own personal research. I am planning on open sourcing it at some point in time with proper tests but I’ll for the time being I’ll share the core code below. Assuming you’ve got this code in the root directory of your Rasa project in a file called fastfeatures.py.

import typing
from typing import Any, Optional, Text, Dict, List, Type
import fasttext
import numpy as np
import os

from rasa.nlu.components import Component
from rasa.nlu.featurizers.featurizer import DenseFeaturizer
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.tokenizers.tokenizer import Tokenizer

if typing.TYPE_CHECKING:
    from rasa.nlu.model import Metadata
from rasa.nlu.constants import DENSE_FEATURE_NAMES, DENSE_FEATURIZABLE_ATTRIBUTES, TEXT


class FastTextFeaturizer(DenseFeaturizer):
    """This component adds fasttext features."""

    @classmethod
    def required_components(cls) -> List[Type[Component]]:
        return [Tokenizer]

    @classmethod
    def required_packages(cls) -> List[Text]:
        return ["fasttext"]

    defaults = {"file": None, "cache_dir": None}
    language_list = None

    def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
        super().__init__(component_config)
        path = os.path.join(component_config["cache_dir"], component_config["file"])
        self.model = fasttext.load_model(path)

    def train(
        self,
        training_data: TrainingData,
        config: Optional[RasaNLUModelConfig] = None,
        **kwargs: Any,
    ) -> None:
        for example in training_data.intent_examples:
            for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
                self.set_fasttext_features(example, attribute)

    def set_fasttext_features(self, message: Message, attribute: Text = TEXT):
        text_vector = self.model.get_word_vector(message.text)
        word_vectors = [
            self.model.get_word_vector(t.text)
            for t in message.data["tokens"]
            if t.text != "__CLS__"
        ]
        X = np.array(word_vectors + [text_vector])  # remember, we need one for __CLS__

        features = self._combine_with_existing_dense_features(
            message, additional_features=X, feature_name=DENSE_FEATURE_NAMES[attribute]
        )
        message.set(DENSE_FEATURE_NAMES[attribute], features)

    def process(self, message: Message, **kwargs: Any) -> None:
        self.set_fasttext_features(message)

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        pass

    @classmethod
    def load(
        cls,
        meta: Dict[Text, Any],
        model_dir: Optional[Text] = None,
        model_metadata: Optional["Metadata"] = None,
        cached_component: Optional["Component"] = None,
        **kwargs: Any,
    ) -> "Component":
        """Load this component from file."""

        if cached_component:
            return cached_component
        else:
            return cls(meta)

Then you should be able to add it to your pipeline like so;

language: en

pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: fastfeatures.FastTextFeaturizer
  cache_dir: "<path>/<to>/<folder>"
  file: "cc.en.300.bin.gz"
- name: DIETClassifier
  epochs: 200

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

In your case you’d need to replace cc.en.300.bin.gz with the file for Finnish.

A few things to mention. This code runs locally on my machine using Rasa 1.10.0. As we move closer to Rasa 2.0 some internals might change and I just want to make sure that you’re aware of that. The goal is to host some custom component like this on github in the future but I want to make sure that the tools that we host are 100% Rasa 2.0 compatible.

I hope this helps.

1 Like

Hi Pauli & Vincent As a Finn & native finnish language speaker I Just wanted to make an update on this topic. As we now have Finnish language Spacy here

And since the existing documentation and examples of using Spacy components are pretty old ( from3-5/2020) I’d propose a “how to” update?

1 Like