Maybe it will be possible to add an optional parameter defining if loaded model was saved as PyTorch checkpoint? And write in the docs that setting param to True will require installation of PyTorch. I tried it locally and it works - RuBert model was loaded. If this is ok - I’ll create a pull request.
@ezhvsalate You can also convert the pytorch checkpoint into a compatible tensorflow checkpoint using this script and then load the model - transformers/convert_pytorch_checkpoint_to_tf2.py at master · huggingface/transformers · GitHub
how can i load different variants of the same language model using the parameter
can you specify the process of doing so in pipeline.
For example, if the variant you want to use is
bert-base-uncased, then your pipeline would look something like -
pipeline: - name: HFTransformersNLP model_name: "bert" model_weights: "bert-base-uncased" - name: LanguageModelTokenizer - name: LanguageModelFeaturizer - name: DIETClassifier
If you want to load the model weights from huggingface compatible model checkpoint stored locally, you can pass its path as well as the value of the
@dakshvar22 for loading the local model what will be the parameter i shall use?
Path to the directory containing the model checkpoint.
pipeline: - name: HFTransformersNLP model_name: "bert" model_weights: "path/to/your/model" - name: LanguageModelTokenizer - name: LanguageModelFeaturizer - name: DIETClassifier
Can someone take a look at this one? Training with BERT is constantly failing.
# https://rasa.com/docs/rasa/nlu/components/ language: zh pipeline: - name: HFTransformersNLP # Name of the language model to use model_name: "roberta" # Pre-Trained weights to be loaded model_weights: "data/roberta_chinese_base" # An optional path to a specific directory to download and cache the pre-trained model weights. # The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory . #cache_dir: null - name: LanguageModelTokenizer # Flag to check whether to split intents intent_tokenization_flag: False # Symbol on which intent should be split intent_split_symbol: "_" - name: LanguageModelFeaturizer #- name: CountVectorsFeaturizer - name: DIETClassifier epochs: 100 - name: EntitySynonymMapper - name: ResponseSelector epochs: 100 # Configuration for Rasa Core. # https://rasa.com/docs/rasa/core/policies/ policies: - name: MemoizationPolicy - name: TEDPolicy max_history: 5 epochs: 100 - name: MappingPolicy
2020-05-30 16:02:01 INFO transformers.tokenization_utils - Model name ‘data/roberta_chinese_base’ not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-det ector). Assuming ‘data/roberta_chinese_base’ is a path, a model identifier, or url to a directory containing tokenizer files. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\vocab.json. We won’t load it. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\merges.txt. We won’t load it. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\added_tokens.json. We won’t load it. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\special_tokens_map.json. We won’t load it. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\tokenizer_config.json. We won’t load it.
OSError: Model name ‘data/roberta_chinese_base’ was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed ‘data/roberta_chinese_base’ wa s a path, a model identifier, or url to a directory containing vocabulary files named [‘vocab.json’, ‘merges.txt’] but couldn’t find such vocabulary files at this path or url.
OSError: Model name ‘bert-base-uncased’ was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, b ert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbm dz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1, bert-base-dutch-cased). We assumed ‘bert-base-uncased’ was a path, a model identifier, or url to a directory containing vocabulary files named [‘vocab.txt’] but couldn’t find such vocabulary file s at this path or url.
I’m trying to run this below config in rasa nlu
language: en pipeline:
- name: HFTransformersNLP model_weights: “bert-base-uncased” model_name: “bert”
- name: LanguageModelTokenizer
- name: LanguageModelFeaturizer
- name: DIETClassifier
- epochs: 200
however, not able to run it , getting below errors
Traceback (most recent call last): File “/Users/malarvizhisaravanan/opt/anaconda3/bin/rasa”, line 10, in sys.exit(main()) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/main.py”, line 91, in main cmdline_arguments.func(cmdline_arguments) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/cli/train.py”, line 140, in train_nlu persist_nlu_training_data=args.persist_nlu_data, File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/train.py”, line 414, in train_nlu persist_nlu_training_data, File “uvloop/loop.pyx”, line 1456, in uvloop.loop.Loop.run_until_complete File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/train.py”, line 453, in _train_nlu_async persist_nlu_training_data=persist_nlu_training_data, File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/train.py”, line 482, in _train_nlu_with_validated_data persist_nlu_training_data=persist_nlu_training_data, File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/train.py”, line 75, in train trainer = Trainer(nlu_config, component_builder) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/model.py”, line 142, in init components.validate_requirements(cfg.component_names) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/components.py”, line 51, in validate_requirements component_class = registry.get_component_class(component_name) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/registry.py”, line 173, in get_component_class return class_from_module_path(component_name) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/utils/common.py”, line 196, in class_from_module_path if “.” in module_path: TypeError: argument of type ‘NoneType’ is not iterable
@dakshvar22 - any suggestions on this ?
@malarsarav Which version of Rasa are you using? Can you update to the latest version in a new virtual env and open a separate forum issue if the problem persists. Thanks
Did anyone solved the problem of bert-base-uncased no found error?
I am a newbie to RASA. Could someone please help me understand if I can use the above pipeline to create a Japanese language chatbot? Also, is it possible to use it using the Spacy pipeline provided in RASA docs and a language customization? Which of the two is recommended?
Hey @koaning , I’ve just got started with Rasa and have gone through basics. I have to mention that your videos and rasa NLU examples were quite helpful. As I understood, currently rasa doesn’t support pytorch based language models. I want to know if it’s possible to create a custom pipeline component to perform the language model related stuff in a python script by adding a pytorch based language model to the python script and then add it on the top of the pipeline (without using HFTransformersNLP on the top of the pipeline) just like we can add custom components for sentiment analysis for example. Sorry if the question is not that clear.
I’m using Rasa 2.2.8 btw. Even for newer versions, is it possible? I am currently doing a research project on fine-tuning a XLM-R based language model for Sinhalese using pytorch. It’ll be nice if I can add it as a custom component since Rasa doesn’t support that out of the box.
Oh yeah, it’s totally possible to write your own models. In fact, there’s plenty of examples over at rasa-nlu-examples. There’s also some examples of custom classifiers and featurizers in there. Note though that right now we’re going to be transitioning these components to Rasa 3.x. The latest release for Ras 2.x is found here.
A few caveats though.
- Huggingface featurizers are natively supported already. These are supported via LanguageModelFeaturizer.
- Usually, you should delay custom components. Typically the most pressing thing when you’re building an assistant is the data that you’re learning on. The DIET architecture is pretty good at picking up many patterns from many languages and I wouldn’t worry too much about an optimal pipeline unless you have a large representative dataset.
@koaning If I want to attach the xlm-roberta-base to the pipeline via LanguageModelFeaturizer, is it possible? If so, can you please explain a bit on how I can do that? I am sorry but in the documentation I was only able to find bert, gpt, gpt2, xlnet, distilbert, and roberta based models, that’s why I had to ask. (if I want to add xlm-roberta-base model, what should be the “model_name” and “model_weights” 'cause there are no defaults given for those for xml-roberta-base in rasa documentation)
… and thank you very much for all the info. That helps a lot.
Just to confirm, in the huggingface section of the Non-English NLU blogpost there’s this snippet.
- name: LanguageModelFeaturizer model_name: bert model_weights: asafaya/bert-base-arabic
The idea is that a
bert-kind of huggingface model can be used in Rasa but that you’ll need to give it appropriate weights. Am I understanding it correctly that
xml-roberta-base refers to a non-roberta model?
It’d help if you could share the
config.yml file that you tried to run.
@koaning, Adding bert based models works just fine. I’ve tried it with the following config.
language: si pipeline: - name: "HFTransformersNLP" model_name: "roberta" model_weights: "keshan/SinhalaBERTo" cache_dir: "hf_lm_weights/bert_si" - name: "LanguageModelTokenizer" - name: "LanguageModelFeaturizer" - name: "LexicalSyntacticFeaturizer" - name: "CountVectorsFeaturizer" - name: "CountVectorsFeaturizer" analyzer: "char_wb" min_ngram: 1 max_ngram: 4 - name: "CountVectorsFeaturizer" analyzer: "char" min_ngram: 3 max_ngram: 5 - name: "DIETClassifier" entity_recognition: true epochs: 300 - name: "EntitySynonymMapper" - name: "ResponseSelector" epochs: 300 retrieval_intent: faq policies: - name: RulePolicy
My question is that is it possible to attach
xml-roberta-base model in the same way? If I want to add it to the pipeline via
LanguageModelFeaturizer, how do I have to specify
model_weights? That’s where I’m stuck because I couldn’t find those parameters in the documentaion for
xml-roberta based models.
From the top of my head; the
xml-roberta-base would refer to the weights and the architecture/model_name would be
Right! I’ll see if that works. I thought they are different. @koaning Thanks a lot for the help.