the Bert Model is very heavy for my project, I have to use Albert instead of Bert have you any suggestions for me? or how can I create a custom component for this? extend from LanguageModelFeature or HFTransfromers?
Hi! If none of models listed below work for you, you should be able to extend LanguageModelFeaturizer (do not use HFTransformersNLP, as this component has been deprecated) to use a different one. You need to change the following values:
LMFeaturizer.model_weights #(string)
LMFeaturizer.model #(hugging face class, for example AlbertModel)
LMFeaturizer.tokenizer #(hugging face class, for example AlbertTokenizer)
LMFeaturizer.max_model_sequence_length #(int)
You should figure out the correct values for each, and set these accordingly in the constructor (max_model_sequence_length) or the method _load_model_instance (for all other values).
bert
gpt
gpt2
xlnet
distilbert
roberta
If you get stuck feel free to ask a followup!
is this ok? for using in the pipeline instead of LanguageModelFeaturizer
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from transformers import TFAlbertModel
from transformers import AlbertTokenizer
class AlbertFeaturizer(LanguageModelFeaturizer):
def _load_model_metadata(self) -> None:
self.model_name = 'albert'
self.model_weights = 'albert-base-v2'
self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights)
self.model = TFAlbertModel.from_pretrained(self.model_weights)
self.max_model_sequence_length = 512
self.pad_token_id = self.tokenizer.unk_token_id
I think that wonât work for you, because _load_model_instance is called after _load_model_metadata, and the former will overwrite the values set in the latter. You should override _load_model_instance and set self.tokenizer and self.model there instead
To be clear, you can keep the first parts of _load_model_metadata (where you set model_name, model_weights), but youn also need to override _load_model_instance and set tokenizer and model there
finally I wrote it thank you for helping
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from typing import List, Text
import numpy as np
from transformers import TFAlbertModel
from transformers import AlbertTokenizer
class AlbertFeaturizer(LanguageModelFeaturizer):
"""doc str"""
def _load_model_metadata(self) -> None:
self.model_name = 'albert'
# self.model_weights = 'albert-base-v2'
self.model_weights = 'm3hrdadfi/albert-fa-base-v2'
self.max_model_sequence_length = 512
def _load_model_instance(self, skip_model_load: bool) -> None:
self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights)
self.model = TFAlbertModel.from_pretrained(self.model_weights, from_pt=True) # check it
self.model.trainable = False
self.pad_token_id = self.tokenizer.unk_token_id
def _lm_specific_token_cleanup(self, split_token_ids: List[int], token_strings: List[Text]):
token_ids_string = list(zip(split_token_ids, token_strings))
token_ids_string = [(id, string.replace("â", "")) for id, string in token_ids_string]
# remove empty strings
token_ids_string = [(id, string) for id, string in token_ids_string if string]
# return as individual token ids and token strings
token_ids, token_strings = zip(*token_ids_string)
return token_ids, token_strings
def _add_lm_specific_special_tokens(self, token_ids: List[List[int]]):
augmented_tokens = []
for example_token_ids in token_ids:
example_token_ids.insert(0, 2) # insert CLS token id (in Albert)
example_token_ids.append(3) # # insert SEP token id (in Albert)
augmented_tokens.append(example_token_ids)
return augmented_tokens
def _post_process_sequence_embeddings(self, sequence_embeddings: np.ndarray):
"""Compute sentence and sequence level representations for relevant tokens.
Args:
sequence_embeddings: Sequence level dense features received as output from
language model.
Returns: Sentence and sequence level representations.
"""
sentence_embeddings = []
post_processed_sequence_embeddings = []
for example_embedding in sequence_embeddings:
sentence_embeddings.append(example_embedding[0])
post_processed_sequence_embeddings.append(example_embedding[1:-1])
return np.array(sentence_embeddings), np.array(post_processed_sequence_embeddings)
Glad to hear it 
are the models in Language Model Featurizer freeze and we just use their embedding? or they are fine-tuned I use DietClassifier after this model
this is my config file:
language: "fa"
# with my custom albert
pipeline:
- name: LanguageModelTokenizer
- name: "albert.AlbertFeaturizer"
alias: "bert-embdds"
- name: CountVectorsFeaturizer
alias: "one-hot"
- name: CountVectorsFeaturizer
alias: "bov"
analyzer: char_wb
min_ngram: 2
max_ngram: 5
- name: RegexFeaturizer
case_sensitive: False
- name: DIETClassifier
batch_strategy: sequence
featurizers: ["bert-embdds", "one-hot", "bov"]
epochs: 200
learning_rate: 0.002
That should be fine, we donât do any finetuning of the models out of the box.
is there any way I do it? I mean fine-tuning
for example, extending the DIET model
Sorry, now I am not sure whether I understood your question correctly. What would you like to fine-tune? The featurizers? Or DIET?
the featurizers (embeddings that feed to DIET)
I wouldnât recommend fine-tuning those. You likely donât have enough data to improve the models, and will just end up introducing noise.
@sajjjadayobi i am begineer on rasa and wanted to know more about this extension on languagemodelfeaturizer. And from the code portion i got the first two methods on AlbertFeaturizer but not sure why last 3 methods i.e. def _lm_specific_token_cleanup, def_add_lm_specific_special_tokens and def _post_process_sequence_embedding is being used here? Is it really necessary to add this methods?
@fkoerner are those 3 methods necessary for it to run?
Hi @iamsid, yes, youâll need to extend those methods, otherwise youâll get a key error here:
model_tokens_cleaners[self.model_name](split_token_ids, token_strings) and similar for the pre- and post-processing.
It looks like Albert requires the same tokens distilbert and bert like [CLS] and [SEP] to be fed into the model (see here), these should be removed again afterwards. It also looks like Albert may add ## characters, these should also be removed.
Fun fact! The original DIET paper did this experiment and found that not fine-tuning the BERT features lead to the best model for intent prediction. Granted, thereâs an overlapping margin, but it seems thereâs evidence that one shouldnât worry too much about fine-tuning the BERT models.
