the Bert Model is very heavy for my project, I have to use Albert instead of Bert have you any suggestions for me? or how can I create a custom component for this? extend from LanguageModelFeature or HFTransfromers?
Hi! If none of models listed below work for you, you should be able to extend
LanguageModelFeaturizer (do not use
HFTransformersNLP, as this component has been deprecated) to use a different one. You need to change the following values:
LMFeaturizer.model_weights #(string) LMFeaturizer.model #(hugging face class, for example AlbertModel) LMFeaturizer.tokenizer #(hugging face class, for example AlbertTokenizer) LMFeaturizer.max_model_sequence_length #(int)
You should figure out the correct values for each, and set these accordingly in the constructor (
max_model_sequence_length) or the method
_load_model_instance (for all other values).
bert gpt gpt2 xlnet distilbert roberta
If you get stuck feel free to ask a followup!
is this ok? for using in the pipeline instead of
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer from transformers import TFAlbertModel from transformers import AlbertTokenizer class AlbertFeaturizer(LanguageModelFeaturizer): def _load_model_metadata(self) -> None: self.model_name = 'albert' self.model_weights = 'albert-base-v2' self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights) self.model = TFAlbertModel.from_pretrained(self.model_weights) self.max_model_sequence_length = 512 self.pad_token_id = self.tokenizer.unk_token_id
I think that won’t work for you, because
_load_model_instance is called after
_load_model_metadata, and the former will overwrite the values set in the latter. You should override
_load_model_instance and set
self.model there instead
To be clear, you can keep the first parts of
_load_model_metadata (where you set
model_name, model_weights), but youn also need to override
_load_model_instance and set
finally I wrote it thank you for helping
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer from typing import List, Text import numpy as np from transformers import TFAlbertModel from transformers import AlbertTokenizer class AlbertFeaturizer(LanguageModelFeaturizer): """doc str""" def _load_model_metadata(self) -> None: self.model_name = 'albert' # self.model_weights = 'albert-base-v2' self.model_weights = 'm3hrdadfi/albert-fa-base-v2' self.max_model_sequence_length = 512 def _load_model_instance(self, skip_model_load: bool) -> None: self.tokenizer = AlbertTokenizer.from_pretrained(self.model_weights) self.model = TFAlbertModel.from_pretrained(self.model_weights, from_pt=True) # check it self.model.trainable = False self.pad_token_id = self.tokenizer.unk_token_id def _lm_specific_token_cleanup(self, split_token_ids: List[int], token_strings: List[Text]): token_ids_string = list(zip(split_token_ids, token_strings)) token_ids_string = [(id, string.replace("▁", "")) for id, string in token_ids_string] # remove empty strings token_ids_string = [(id, string) for id, string in token_ids_string if string] # return as individual token ids and token strings token_ids, token_strings = zip(*token_ids_string) return token_ids, token_strings def _add_lm_specific_special_tokens(self, token_ids: List[List[int]]): augmented_tokens =  for example_token_ids in token_ids: example_token_ids.insert(0, 2) # insert CLS token id (in Albert) example_token_ids.append(3) # # insert SEP token id (in Albert) augmented_tokens.append(example_token_ids) return augmented_tokens def _post_process_sequence_embeddings(self, sequence_embeddings: np.ndarray): """Compute sentence and sequence level representations for relevant tokens. Args: sequence_embeddings: Sequence level dense features received as output from language model. Returns: Sentence and sequence level representations. """ sentence_embeddings =  post_processed_sequence_embeddings =  for example_embedding in sequence_embeddings: sentence_embeddings.append(example_embedding) post_processed_sequence_embeddings.append(example_embedding[1:-1]) return np.array(sentence_embeddings), np.array(post_processed_sequence_embeddings)
Glad to hear it
are the models in Language Model Featurizer freeze and we just use their embedding? or they are fine-tuned I use DietClassifier after this model
this is my config file:
language: "fa" # with my custom albert pipeline: - name: LanguageModelTokenizer - name: "albert.AlbertFeaturizer" alias: "bert-embdds" - name: CountVectorsFeaturizer alias: "one-hot" - name: CountVectorsFeaturizer alias: "bov" analyzer: char_wb min_ngram: 2 max_ngram: 5 - name: RegexFeaturizer case_sensitive: False - name: DIETClassifier batch_strategy: sequence featurizers: ["bert-embdds", "one-hot", "bov"] epochs: 200 learning_rate: 0.002
That should be fine, we don’t do any finetuning of the models out of the box.
is there any way I do it? I mean fine-tuning
for example, extending the DIET model
Sorry, now I am not sure whether I understood your question correctly. What would you like to fine-tune? The featurizers? Or DIET?
the featurizers (embeddings that feed to DIET)
I wouldn’t recommend fine-tuning those. You likely don’t have enough data to improve the models, and will just end up introducing noise.
@sajjjadayobi i am begineer on rasa and wanted to know more about this extension on languagemodelfeaturizer. And from the code portion i got the first two methods on AlbertFeaturizer but not sure why last 3 methods i.e. def _lm_specific_token_cleanup, def_add_lm_specific_special_tokens and def _post_process_sequence_embedding is being used here? Is it really necessary to add this methods?
@fkoerner are those 3 methods necessary for it to run?
Hi @iamsid, yes, you’ll need to extend those methods, otherwise you’ll get a key error here:
model_tokens_cleaners[self.model_name](split_token_ids, token_strings) and similar for the pre- and post-processing.
It looks like Albert requires the same tokens
[SEP] to be fed into the model (see here), these should be removed again afterwards. It also looks like Albert may add ## characters, these should also be removed.