I need a Albert in LanguageModelFeature

the Bert Model is very heavy for my project, I have to use Albert instead of Bert have you any suggestions for me? or how can I create a custom component for this? extend from LanguageModelFeature or HFTransfromers?

Hi! If none of models listed below work for you, you should be able to extend LanguageModelFeaturizer (do not use HFTransformersNLP, as this component has been deprecated) to use a different one. You need to change the following values:

LMFeaturizer.model_weights #(string)
LMFeaturizer.model #(hugging face class, for example AlbertModel)
LMFeaturizer.tokenizer #(hugging face class, for example AlbertTokenizer)
LMFeaturizer.max_model_sequence_length #(int)

You should figure out the correct values for each, and set these accordingly in the constructor (max_model_sequence_length) or the method _load_model_instance (for all other values).

bert
gpt
gpt2
xlnet
distilbert
roberta

If you get stuck feel free to ask a followup!

is this ok? for using in the pipeline instead of LanguageModelFeaturizer

from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from transformers import TFAlbertModel
from transformers import AlbertTokenizer

class AlbertFeaturizer(LanguageModelFeaturizer):
    def _load_model_metadata(self) -> None:
        self.model_name = 'albert'
        self.model_weights = 'albert-base-v2'
        self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights)
        self.model = TFAlbertModel.from_pretrained(self.model_weights)
        self.max_model_sequence_length = 512
        self.pad_token_id = self.tokenizer.unk_token_id

I think that won’t work for you, because _load_model_instance is called after _load_model_metadata, and the former will overwrite the values set in the latter. You should override _load_model_instance and set self.tokenizer and self.model there instead

To be clear, you can keep the first parts of _load_model_metadata (where you set model_name, model_weights), but youn also need to override _load_model_instance and set tokenizer and model there

finally I wrote it thank you for helping

from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from typing import List, Text
import numpy as np
from transformers import TFAlbertModel
from transformers import AlbertTokenizer


class AlbertFeaturizer(LanguageModelFeaturizer):
    """doc str"""
    def _load_model_metadata(self) -> None:
        self.model_name = 'albert'
        # self.model_weights = 'albert-base-v2'
        self.model_weights = 'm3hrdadfi/albert-fa-base-v2'
        self.max_model_sequence_length = 512

    def _load_model_instance(self, skip_model_load: bool) -> None:
        self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights)
        self.model = TFAlbertModel.from_pretrained(self.model_weights, from_pt=True) # check it 
        self.model.trainable = False
        self.pad_token_id = self.tokenizer.unk_token_id

    def _lm_specific_token_cleanup(self, split_token_ids: List[int], token_strings: List[Text]):
        token_ids_string = list(zip(split_token_ids, token_strings))
        token_ids_string = [(id, string.replace("▁", "")) for id, string in token_ids_string]
        # remove empty strings
        token_ids_string = [(id, string) for id, string in token_ids_string if string]
        # return as individual token ids and token strings
        token_ids, token_strings = zip(*token_ids_string)
        return token_ids, token_strings


    def _add_lm_specific_special_tokens(self, token_ids: List[List[int]]):
        augmented_tokens = []
        for example_token_ids in token_ids:
            example_token_ids.insert(0, 2)  # insert CLS token id (in Albert)
            example_token_ids.append(3)  # # insert SEP token id (in Albert)
            augmented_tokens.append(example_token_ids)
        return augmented_tokens


    def _post_process_sequence_embeddings(self, sequence_embeddings: np.ndarray):
        """Compute sentence and sequence level representations for relevant tokens.
        Args:
            sequence_embeddings: Sequence level dense features received as output from
            language model.
        Returns: Sentence and sequence level representations.
        """
        sentence_embeddings = []
        post_processed_sequence_embeddings = []

        for example_embedding in sequence_embeddings:
            sentence_embeddings.append(example_embedding[0])
            post_processed_sequence_embeddings.append(example_embedding[1:-1])

        return np.array(sentence_embeddings), np.array(post_processed_sequence_embeddings)
2 Likes

Glad to hear it :slight_smile:

are the models in Language Model Featurizer freeze and we just use their embedding? or they are fine-tuned I use DietClassifier after this model

this is my config file:

language: "fa"  
# with my custom albert
pipeline:
  - name: LanguageModelTokenizer
  - name: "albert.AlbertFeaturizer"
    alias: "bert-embdds"
  - name: CountVectorsFeaturizer
    alias: "one-hot"
  - name: CountVectorsFeaturizer
    alias: "bov"
    analyzer: char_wb
    min_ngram: 2
    max_ngram: 5
  - name: RegexFeaturizer
    case_sensitive: False
  - name: DIETClassifier
    batch_strategy: sequence
    featurizers: ["bert-embdds", "one-hot", "bov"]
    epochs: 200
    learning_rate: 0.002

That should be fine, we don’t do any finetuning of the models out of the box.

is there any way I do it? I mean fine-tuning

for example, extending the DIET model

Sorry, now I am not sure whether I understood your question correctly. What would you like to fine-tune? The featurizers? Or DIET?

the featurizers (embeddings that feed to DIET)

I wouldn’t recommend fine-tuning those. You likely don’t have enough data to improve the models, and will just end up introducing noise.

1 Like

@sajjjadayobi i am begineer on rasa and wanted to know more about this extension on languagemodelfeaturizer. And from the code portion i got the first two methods on AlbertFeaturizer but not sure why last 3 methods i.e. def _lm_specific_token_cleanup, def_add_lm_specific_special_tokens and def _post_process_sequence_embedding is being used here? Is it really necessary to add this methods?

@fkoerner are those 3 methods necessary for it to run?

Hi @iamsid, yes, you’ll need to extend those methods, otherwise you’ll get a key error here: model_tokens_cleaners[self.model_name](split_token_ids, token_strings) and similar for the pre- and post-processing.

It looks like Albert requires the same tokens distilbert and bert like [CLS] and [SEP] to be fed into the model (see here), these should be removed again afterwards. It also looks like Albert may add ## characters, these should also be removed.