With Rasa Open Source 1.8, we added support for leveraging language models like BERT, GPT-2, etc. These models can now be used as featurizers inside your NLU pipeline for intent classification, entity recognition and response selection models. The following snippet shows how to configure your pipeline to leverage BERT model as an example -
You can load different variants of the same language model using the parameter model_weights depending on the size of the model and language of your training corpus. For example, there are chinese (bert-base-chinese) and japanese (bert-base-japanese) variants of the BERT model which you can load if your training data is in chinese or japanese respectively. A full list of different variants of these language models is available in the official documentation of the Transformers library.
Please note, the current implementation uses these language models strictly as a featurizer which means that its weights are not fine-tuned along with the training of downstream NLU components like DIETClassifier, etc.
As always, you can still use multiple featurizers in your pipeline, for example -
We would love to hear everyoneâs feedback on it in terms of how it performs on your internal datasets, specially when used in combination with the newly introduced DIETClassifier.
Is there a way to load HF-transformers compatible model saved in pytorch format?
Unfortunately there is no RuBERT model in TF2.0 format.
When I try to load pytorch model there is an error:
OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5'] found in directory /opt/rubert/conversational_cased_L-12_H-768_A-12_pt/ or `from_pt` set to False
There exists pytorch_model.bin so I think the case is from_pt set to False.
Maybe it will be possible to add an optional parameter defining if loaded model was saved as PyTorch checkpoint? And write in the docs that setting param to True will require installation of PyTorch.
I tried it locally and it works - RuBert model was loaded. If this is ok - Iâll create a pull request.
If you want to load the model weights from huggingface compatible model checkpoint stored locally, you can pass its path as well as the value of the model_weights parameter
# https://rasa.com/docs/rasa/nlu/components/
language: zh
pipeline:
- name: HFTransformersNLP
# Name of the language model to use
model_name: "roberta"
# Pre-Trained weights to be loaded
model_weights: "data/roberta_chinese_base"
# An optional path to a specific directory to download and cache the pre-trained model weights.
# The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
#cache_dir: null
- name: LanguageModelTokenizer
# Flag to check whether to split intents
intent_tokenization_flag: False
# Symbol on which intent should be split
intent_split_symbol: "_"
- name: LanguageModelFeaturizer
#- name: CountVectorsFeaturizer
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
- name: MemoizationPolicy
- name: TEDPolicy
max_history: 5
epochs: 100
- name: MappingPolicy
2020-05-30 16:02:01 INFO transformers.tokenization_utils - Model name âdata/roberta_chinese_baseâ not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-det
ector). Assuming âdata/roberta_chinese_baseâ is a path, a model identifier, or url to a directory containing tokenizer files.
2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didnât find file data/roberta_chinese_base\vocab.json. We wonât load it.
2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didnât find file data/roberta_chinese_base\merges.txt. We wonât load it.
2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didnât find file data/roberta_chinese_base\added_tokens.json. We wonât load it.
2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didnât find file data/roberta_chinese_base\special_tokens_map.json. We wonât load it.
2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didnât find file data/roberta_chinese_base\tokenizer_config.json. We wonât load it.
OSError: Model name âdata/roberta_chinese_baseâ was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed âdata/roberta_chinese_baseâ wa
s a path, a model identifier, or url to a directory containing vocabulary files named [âvocab.jsonâ, âmerges.txtâ] but couldnât find such vocabulary files at this path or url.
OSError: Model name âbert-base-uncasedâ was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, b
ert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbm
dz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1, bert-base-dutch-cased). We assumed âbert-base-uncasedâ was a path, a model identifier, or url to a directory containing vocabulary files named [âvocab.txtâ] but couldnât find such vocabulary file
s at this path or url.
however, not able to run it , getting below errors
Traceback (most recent call last):
File â/Users/malarvizhisaravanan/opt/anaconda3/bin/rasaâ, line 10, in
sys.exit(main())
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/main.pyâ, line 91, in main
cmdline_arguments.func(cmdline_arguments)
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/cli/train.pyâ, line 140, in train_nlu
persist_nlu_training_data=args.persist_nlu_data,
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/train.pyâ, line 414, in train_nlu
persist_nlu_training_data,
File âuvloop/loop.pyxâ, line 1456, in uvloop.loop.Loop.run_until_complete
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/train.pyâ, line 453, in _train_nlu_async
persist_nlu_training_data=persist_nlu_training_data,
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/train.pyâ, line 482, in _train_nlu_with_validated_data
persist_nlu_training_data=persist_nlu_training_data,
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/train.pyâ, line 75, in train
trainer = Trainer(nlu_config, component_builder)
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/model.pyâ, line 142, in init
components.validate_requirements(cfg.component_names)
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/components.pyâ, line 51, in validate_requirements
component_class = registry.get_component_class(component_name)
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/registry.pyâ, line 173, in get_component_class
return class_from_module_path(component_name)
File â/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/utils/common.pyâ, line 196, in class_from_module_path
if â.â in module_path:
TypeError: argument of type âNoneTypeâ is not iterable
@malarsarav Which version of Rasa are you using? Can you update to the latest version in a new virtual env and open a separate forum issue if the problem persists. Thanks
I am a newbie to RASA. Could someone please help me understand if I can use the above pipeline to create a Japanese language chatbot?
Also, is it possible to use it using the Spacy pipeline provided in RASA docs and a language customization?
Which of the two is recommended?
Hey @koaning , Iâve just got started with Rasa and have gone through basics. I have to mention that your videos and rasa NLU examples were quite helpful. As I understood, currently rasa doesnât support pytorch based language models. I want to know if itâs possible to create a custom pipeline component to perform the language model related stuff in a python script by adding a pytorch based language model to the python script and then add it on the top of the pipeline (without using HFTransformersNLP on the top of the pipeline) just like we can add custom components for sentiment analysis for example. Sorry if the question is not that clear.
Iâm using Rasa 2.2.8 btw. Even for newer versions, is it possible? I am currently doing a research project on fine-tuning a XLM-R based language model for Sinhalese using pytorch. Itâll be nice if I can add it as a custom component since Rasa doesnât support that out of the box.