the Bert Model is very heavy for my project, I have to use Albert instead of Bert have you any suggestions for me? or how can I create a custom component for this? extend from LanguageModelFeature or HFTransfromers?
Hi! If none of models listed below work for you, you should be able to extend LanguageModelFeaturizer
(do not use HFTransformersNLP
, as this component has been deprecated) to use a different one. You need to change the following values:
LMFeaturizer.model_weights #(string)
LMFeaturizer.model #(hugging face class, for example AlbertModel)
LMFeaturizer.tokenizer #(hugging face class, for example AlbertTokenizer)
LMFeaturizer.max_model_sequence_length #(int)
You should figure out the correct values for each, and set these accordingly in the constructor (max_model_sequence_length
) or the method _load_model_instance
(for all other values).
bert
gpt
gpt2
xlnet
distilbert
roberta
If you get stuck feel free to ask a followup!
is this ok? for using in the pipeline instead of LanguageModelFeaturizer
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from transformers import TFAlbertModel
from transformers import AlbertTokenizer
class AlbertFeaturizer(LanguageModelFeaturizer):
def _load_model_metadata(self) -> None:
self.model_name = 'albert'
self.model_weights = 'albert-base-v2'
self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights)
self.model = TFAlbertModel.from_pretrained(self.model_weights)
self.max_model_sequence_length = 512
self.pad_token_id = self.tokenizer.unk_token_id
I think that wonât work for you, because _load_model_instance
is called after _load_model_metadata
, and the former will overwrite the values set in the latter. You should override _load_model_instance
and set self.tokenizer
and self.model
there instead
To be clear, you can keep the first parts of _load_model_metadata
(where you set model_name, model_weights
), but youn also need to override _load_model_instance
and set tokenizer
and model
there
finally I wrote it thank you for helping
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from typing import List, Text
import numpy as np
from transformers import TFAlbertModel
from transformers import AlbertTokenizer
class AlbertFeaturizer(LanguageModelFeaturizer):
"""doc str"""
def _load_model_metadata(self) -> None:
self.model_name = 'albert'
# self.model_weights = 'albert-base-v2'
self.model_weights = 'm3hrdadfi/albert-fa-base-v2'
self.max_model_sequence_length = 512
def _load_model_instance(self, skip_model_load: bool) -> None:
self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights)
self.model = TFAlbertModel.from_pretrained(self.model_weights, from_pt=True) # check it
self.model.trainable = False
self.pad_token_id = self.tokenizer.unk_token_id
def _lm_specific_token_cleanup(self, split_token_ids: List[int], token_strings: List[Text]):
token_ids_string = list(zip(split_token_ids, token_strings))
token_ids_string = [(id, string.replace("â", "")) for id, string in token_ids_string]
# remove empty strings
token_ids_string = [(id, string) for id, string in token_ids_string if string]
# return as individual token ids and token strings
token_ids, token_strings = zip(*token_ids_string)
return token_ids, token_strings
def _add_lm_specific_special_tokens(self, token_ids: List[List[int]]):
augmented_tokens = []
for example_token_ids in token_ids:
example_token_ids.insert(0, 2) # insert CLS token id (in Albert)
example_token_ids.append(3) # # insert SEP token id (in Albert)
augmented_tokens.append(example_token_ids)
return augmented_tokens
def _post_process_sequence_embeddings(self, sequence_embeddings: np.ndarray):
"""Compute sentence and sequence level representations for relevant tokens.
Args:
sequence_embeddings: Sequence level dense features received as output from
language model.
Returns: Sentence and sequence level representations.
"""
sentence_embeddings = []
post_processed_sequence_embeddings = []
for example_embedding in sequence_embeddings:
sentence_embeddings.append(example_embedding[0])
post_processed_sequence_embeddings.append(example_embedding[1:-1])
return np.array(sentence_embeddings), np.array(post_processed_sequence_embeddings)
Glad to hear it
are the models in Language Model Featurizer freeze and we just use their embedding? or they are fine-tuned I use DietClassifier after this model
this is my config file:
language: "fa"
# with my custom albert
pipeline:
- name: LanguageModelTokenizer
- name: "albert.AlbertFeaturizer"
alias: "bert-embdds"
- name: CountVectorsFeaturizer
alias: "one-hot"
- name: CountVectorsFeaturizer
alias: "bov"
analyzer: char_wb
min_ngram: 2
max_ngram: 5
- name: RegexFeaturizer
case_sensitive: False
- name: DIETClassifier
batch_strategy: sequence
featurizers: ["bert-embdds", "one-hot", "bov"]
epochs: 200
learning_rate: 0.002
That should be fine, we donât do any finetuning of the models out of the box.
is there any way I do it? I mean fine-tuning
for example, extending the DIET model
Sorry, now I am not sure whether I understood your question correctly. What would you like to fine-tune? The featurizers? Or DIET?
the featurizers (embeddings that feed to DIET)
I wouldnât recommend fine-tuning those. You likely donât have enough data to improve the models, and will just end up introducing noise.
@sajjjadayobi i am begineer on rasa and wanted to know more about this extension on languagemodelfeaturizer. And from the code portion i got the first two methods on AlbertFeaturizer but not sure why last 3 methods i.e. def _lm_specific_token_cleanup, def_add_lm_specific_special_tokens and def _post_process_sequence_embedding is being used here? Is it really necessary to add this methods?
@fkoerner are those 3 methods necessary for it to run?
Hi @iamsid, yes, youâll need to extend those methods, otherwise youâll get a key error here:
model_tokens_cleaners[self.model_name](split_token_ids, token_strings)
and similar for the pre- and post-processing.
It looks like Albert requires the same tokens distilbert
and bert
like [CLS]
and [SEP]
to be fed into the model (see here), these should be removed again afterwards. It also looks like Albert may add ## characters, these should also be removed.
Fun fact! The original DIET paper did this experiment and found that not fine-tuning the BERT features lead to the best model for intent prediction. Granted, thereâs an overlapping margin, but it seems thereâs evidence that one shouldnât worry too much about fine-tuning the BERT models.