Hi Pauli,
I’ve never used Finnish in ML before but I’ll try to offer some help.
If you end up training your own custom spaCy model then you may find this guide useful. It explains how to get a custom spaCy model into Rasa. I’m mentioning this because there are other packages out there that integrate with spaCy. For example there’s some BERT style models that you could have a look at (some of them, if I recall correctly, are multilingual).
This is just some background. I think for you right now there are two paths to consider.
Path 1: spaCy
There’s actually some light support for fasttext inside of spaCy! You can find a guide here. Technically you should be able to load this spaCy model into Rasa. I’ve never done this before but it should be possible.
Path 2: Custom Components
Fasttext embeddings are not natively supported inside of Rasa but I’ve written my own fasttext component for some of my own personal research. I am planning on open sourcing it at some point in time with proper tests but I’ll for the time being I’ll share the core code below. Assuming you’ve got this code in the root directory of your Rasa project in a file called fastfeatures.py
.
import typing
from typing import Any, Optional, Text, Dict, List, Type
import fasttext
import numpy as np
import os
from rasa.nlu.components import Component
from rasa.nlu.featurizers.featurizer import DenseFeaturizer
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.tokenizers.tokenizer import Tokenizer
if typing.TYPE_CHECKING:
from rasa.nlu.model import Metadata
from rasa.nlu.constants import DENSE_FEATURE_NAMES, DENSE_FEATURIZABLE_ATTRIBUTES, TEXT
class FastTextFeaturizer(DenseFeaturizer):
"""This component adds fasttext features."""
@classmethod
def required_components(cls) -> List[Type[Component]]:
return [Tokenizer]
@classmethod
def required_packages(cls) -> List[Text]:
return ["fasttext"]
defaults = {"file": None, "cache_dir": None}
language_list = None
def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
super().__init__(component_config)
path = os.path.join(component_config["cache_dir"], component_config["file"])
self.model = fasttext.load_model(path)
def train(
self,
training_data: TrainingData,
config: Optional[RasaNLUModelConfig] = None,
**kwargs: Any,
) -> None:
for example in training_data.intent_examples:
for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
self.set_fasttext_features(example, attribute)
def set_fasttext_features(self, message: Message, attribute: Text = TEXT):
text_vector = self.model.get_word_vector(message.text)
word_vectors = [
self.model.get_word_vector(t.text)
for t in message.data["tokens"]
if t.text != "__CLS__"
]
X = np.array(word_vectors + [text_vector]) # remember, we need one for __CLS__
features = self._combine_with_existing_dense_features(
message, additional_features=X, feature_name=DENSE_FEATURE_NAMES[attribute]
)
message.set(DENSE_FEATURE_NAMES[attribute], features)
def process(self, message: Message, **kwargs: Any) -> None:
self.set_fasttext_features(message)
def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
pass
@classmethod
def load(
cls,
meta: Dict[Text, Any],
model_dir: Optional[Text] = None,
model_metadata: Optional["Metadata"] = None,
cached_component: Optional["Component"] = None,
**kwargs: Any,
) -> "Component":
"""Load this component from file."""
if cached_component:
return cached_component
else:
return cls(meta)
Then you should be able to add it to your pipeline like so;
language: en
pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: fastfeatures.FastTextFeaturizer
cache_dir: "<path>/<to>/<folder>"
file: "cc.en.300.bin.gz"
- name: DIETClassifier
epochs: 200
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
In your case you’d need to replace cc.en.300.bin.gz
with the file for Finnish.
A few things to mention. This code runs locally on my machine using Rasa 1.10.0. As we move closer to Rasa 2.0 some internals might change and I just want to make sure that you’re aware of that. The goal is to host some custom component like this on github in the future but I want to make sure that the tools that we host are 100% Rasa 2.0 compatible.
I hope this helps.