I need a Albert in LanguageModelFeature

the Bert Model is very heavy for my project, I have to use Albert instead of Bert have you any suggestions for me? or how can I create a custom component for this? extend from LanguageModelFeature or HFTransfromers?

Hi! If none of models listed below work for you, you should be able to extend LanguageModelFeaturizer (do not use HFTransformersNLP, as this component has been deprecated) to use a different one. You need to change the following values:

LMFeaturizer.model_weights #(string)
LMFeaturizer.model #(hugging face class, for example AlbertModel)
LMFeaturizer.tokenizer #(hugging face class, for example AlbertTokenizer)
LMFeaturizer.max_model_sequence_length #(int)

You should figure out the correct values for each, and set these accordingly in the constructor (max_model_sequence_length) or the method _load_model_instance (for all other values).

bert
gpt
gpt2
xlnet
distilbert
roberta

If you get stuck feel free to ask a followup!

is this ok? for using in the pipeline instead of LanguageModelFeaturizer

from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from transformers import TFAlbertModel
from transformers import AlbertTokenizer

class AlbertFeaturizer(LanguageModelFeaturizer):
    def _load_model_metadata(self) -> None:
        self.model_name = 'albert'
        self.model_weights = 'albert-base-v2'
        self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights)
        self.model = TFAlbertModel.from_pretrained(self.model_weights)
        self.max_model_sequence_length = 512
        self.pad_token_id = self.tokenizer.unk_token_id

I think that won’t work for you, because _load_model_instance is called after _load_model_metadata, and the former will overwrite the values set in the latter. You should override _load_model_instance and set self.tokenizer and self.model there instead

To be clear, you can keep the first parts of _load_model_metadata (where you set model_name, model_weights), but youn also need to override _load_model_instance and set tokenizer and model there

finally I wrote it thank you for helping

from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from typing import List, Text
import numpy as np
from transformers import TFAlbertModel
from transformers import AlbertTokenizer


class AlbertFeaturizer(LanguageModelFeaturizer):
    """doc str"""
    def _load_model_metadata(self) -> None:
        self.model_name = 'albert'
        # self.model_weights = 'albert-base-v2'
        self.model_weights = 'm3hrdadfi/albert-fa-base-v2'
        self.max_model_sequence_length = 512

    def _load_model_instance(self, skip_model_load: bool) -> None:
        self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights)
        self.model = TFAlbertModel.from_pretrained(self.model_weights, from_pt=True) # check it 
        self.model.trainable = False
        self.pad_token_id = self.tokenizer.unk_token_id

    def _lm_specific_token_cleanup(self, split_token_ids: List[int], token_strings: List[Text]):
        token_ids_string = list(zip(split_token_ids, token_strings))
        token_ids_string = [(id, string.replace("▁", "")) for id, string in token_ids_string]
        # remove empty strings
        token_ids_string = [(id, string) for id, string in token_ids_string if string]
        # return as individual token ids and token strings
        token_ids, token_strings = zip(*token_ids_string)
        return token_ids, token_strings


    def _add_lm_specific_special_tokens(self, token_ids: List[List[int]]):
        augmented_tokens = []
        for example_token_ids in token_ids:
            example_token_ids.insert(0, 2)  # insert CLS token id (in Albert)
            example_token_ids.append(3)  # # insert SEP token id (in Albert)
            augmented_tokens.append(example_token_ids)
        return augmented_tokens


    def _post_process_sequence_embeddings(self, sequence_embeddings: np.ndarray):
        """Compute sentence and sequence level representations for relevant tokens.
        Args:
            sequence_embeddings: Sequence level dense features received as output from
            language model.
        Returns: Sentence and sequence level representations.
        """
        sentence_embeddings = []
        post_processed_sequence_embeddings = []

        for example_embedding in sequence_embeddings:
            sentence_embeddings.append(example_embedding[0])
            post_processed_sequence_embeddings.append(example_embedding[1:-1])

        return np.array(sentence_embeddings), np.array(post_processed_sequence_embeddings)
2 Likes

Glad to hear it :slight_smile:

are the models in Language Model Featurizer freeze and we just use their embedding? or they are fine-tuned I use DietClassifier after this model

this is my config file:

language: "fa"  
# with my custom albert
pipeline:
  - name: LanguageModelTokenizer
  - name: "albert.AlbertFeaturizer"
    alias: "bert-embdds"
  - name: CountVectorsFeaturizer
    alias: "one-hot"
  - name: CountVectorsFeaturizer
    alias: "bov"
    analyzer: char_wb
    min_ngram: 2
    max_ngram: 5
  - name: RegexFeaturizer
    case_sensitive: False
  - name: DIETClassifier
    batch_strategy: sequence
    featurizers: ["bert-embdds", "one-hot", "bov"]
    epochs: 200
    learning_rate: 0.002

That should be fine, we don’t do any finetuning of the models out of the box.

is there any way I do it? I mean fine-tuning

for example, extending the DIET model

Sorry, now I am not sure whether I understood your question correctly. What would you like to fine-tune? The featurizers? Or DIET?

the featurizers (embeddings that feed to DIET)

I wouldn’t recommend fine-tuning those. You likely don’t have enough data to improve the models, and will just end up introducing noise.

1 Like