Support for Language Models inside Rasa

Maybe it will be possible to add an optional parameter defining if loaded model was saved as PyTorch checkpoint? And write in the docs that setting param to True will require installation of PyTorch. I tried it locally and it works - RuBert model was loaded. If this is ok - I’ll create a pull request.

@ezhvsalate You can also convert the pytorch checkpoint into a compatible tensorflow checkpoint using this script and then load the model - transformers/convert_pytorch_checkpoint_to_tf2.py at master · huggingface/transformers · GitHub

3 Likes

how can i load different variants of the same language model using the parameter model_weights

can you specify the process of doing so in pipeline.

@dakshvar22

For example, if the variant you want to use is bert-base-uncased, then your pipeline would look something like -

pipeline:
   - name: HFTransformersNLP
     model_name: "bert"
     model_weights: "bert-base-uncased"
   - name: LanguageModelTokenizer
   - name: LanguageModelFeaturizer
   - name: DIETClassifier

If you want to load the model weights from huggingface compatible model checkpoint stored locally, you can pass its path as well as the value of the model_weights parameter

1 Like

@dakshvar22 for loading the local model what will be the parameter i shall use?

Path to the directory containing the model checkpoint.

pipeline:
   - name: HFTransformersNLP
     model_name: "bert"
     model_weights: "path/to/your/model"
   - name: LanguageModelTokenizer
   - name: LanguageModelFeaturizer
   - name: DIETClassifier
1 Like

Can someone take a look at this one? Training with BERT is constantly failing.

https://forum.rasa.com/t/uising-bert-with-rasa/28113

# https://rasa.com/docs/rasa/nlu/components/
language: zh
pipeline:
  - name: HFTransformersNLP
    # Name of the language model to use
    model_name: "roberta"
    # Pre-Trained weights to be loaded
    model_weights: "data/roberta_chinese_base"
    # An optional path to a specific directory to download and cache the pre-trained model weights.
    # The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
    #cache_dir: null
  - name: LanguageModelTokenizer
    # Flag to check whether to split intents
    intent_tokenization_flag: False
    # Symbol on which intent should be split
    intent_split_symbol: "_"
  - name: LanguageModelFeaturizer
  #- name: CountVectorsFeaturizer
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy

2020-05-30 16:02:01 INFO transformers.tokenization_utils - Model name ‘data/roberta_chinese_base’ not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-det ector). Assuming ‘data/roberta_chinese_base’ is a path, a model identifier, or url to a directory containing tokenizer files. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\vocab.json. We won’t load it. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\merges.txt. We won’t load it. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\added_tokens.json. We won’t load it. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\special_tokens_map.json. We won’t load it. 2020-05-30 16:02:01 INFO transformers.tokenization_utils - Didn’t find file data/roberta_chinese_base\tokenizer_config.json. We won’t load it.

OSError: Model name ‘data/roberta_chinese_base’ was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed ‘data/roberta_chinese_base’ wa s a path, a model identifier, or url to a directory containing vocabulary files named [‘vocab.json’, ‘merges.txt’] but couldn’t find such vocabulary files at this path or url.

OSError: Model name ‘bert-base-uncased’ was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, b ert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbm dz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1, bert-base-dutch-cased). We assumed ‘bert-base-uncased’ was a path, a model identifier, or url to a directory containing vocabulary files named [‘vocab.txt’] but couldn’t find such vocabulary file s at this path or url.

I’m trying to run this below config in rasa nlu

language: en pipeline:

  • name: HFTransformersNLP model_weights: “bert-base-uncased” model_name: “bert”
  • name: LanguageModelTokenizer
  • name: LanguageModelFeaturizer
  • name: DIETClassifier
  • epochs: 200

however, not able to run it , getting below errors

Traceback (most recent call last): File “/Users/malarvizhisaravanan/opt/anaconda3/bin/rasa”, line 10, in sys.exit(main()) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/main.py”, line 91, in main cmdline_arguments.func(cmdline_arguments) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/cli/train.py”, line 140, in train_nlu persist_nlu_training_data=args.persist_nlu_data, File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/train.py”, line 414, in train_nlu persist_nlu_training_data, File “uvloop/loop.pyx”, line 1456, in uvloop.loop.Loop.run_until_complete File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/train.py”, line 453, in _train_nlu_async persist_nlu_training_data=persist_nlu_training_data, File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/train.py”, line 482, in _train_nlu_with_validated_data persist_nlu_training_data=persist_nlu_training_data, File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/train.py”, line 75, in train trainer = Trainer(nlu_config, component_builder) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/model.py”, line 142, in init components.validate_requirements(cfg.component_names) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/components.py”, line 51, in validate_requirements component_class = registry.get_component_class(component_name) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/nlu/registry.py”, line 173, in get_component_class return class_from_module_path(component_name) File “/Users/malarvizhisaravanan/opt/anaconda3/lib/python3.7/site-packages/rasa/utils/common.py”, line 196, in class_from_module_path if “.” in module_path: TypeError: argument of type ‘NoneType’ is not iterable

@dakshvar22 - any suggestions on this ?

@malarsarav Which version of Rasa are you using? Can you update to the latest version in a new virtual env and open a separate forum issue if the problem persists. Thanks

Did anyone solved the problem of bert-base-uncased no found error?

I am a newbie to RASA. Could someone please help me understand if I can use the above pipeline to create a Japanese language chatbot? Also, is it possible to use it using the Spacy pipeline provided in RASA docs and a language customization? Which of the two is recommended?

Hey @koaning , I’ve just got started with Rasa and have gone through basics. I have to mention that your videos and rasa NLU examples were quite helpful. As I understood, currently rasa doesn’t support pytorch based language models. I want to know if it’s possible to create a custom pipeline component to perform the language model related stuff in a python script by adding a pytorch based language model to the python script and then add it on the top of the pipeline (without using HFTransformersNLP on the top of the pipeline) just like we can add custom components for sentiment analysis for example. Sorry if the question is not that clear.

I’m using Rasa 2.2.8 btw. Even for newer versions, is it possible? I am currently doing a research project on fine-tuning a XLM-R based language model for Sinhalese using pytorch. It’ll be nice if I can add it as a custom component since Rasa doesn’t support that out of the box.

Oh yeah, it’s totally possible to write your own models. In fact, there’s plenty of examples over at rasa-nlu-examples. There’s also some examples of custom classifiers and featurizers in there. Note though that right now we’re going to be transitioning these components to Rasa 3.x. The latest release for Ras 2.x is found here.

A few caveats though.

  1. Huggingface featurizers are natively supported already. These are supported via LanguageModelFeaturizer.
  2. Usually, you should delay custom components. Typically the most pressing thing when you’re building an assistant is the data that you’re learning on. The DIET architecture is pretty good at picking up many patterns from many languages and I wouldn’t worry too much about an optimal pipeline unless you have a large representative dataset.
1 Like

@koaning If I want to attach the xlm-roberta-base to the pipeline via LanguageModelFeaturizer, is it possible? If so, can you please explain a bit on how I can do that? I am sorry but in the documentation I was only able to find bert, gpt, gpt2, xlnet, distilbert, and roberta based models, that’s why I had to ask. (if I want to add xlm-roberta-base model, what should be the “model_name” and “model_weights” 'cause there are no defaults given for those for xml-roberta-base in rasa documentation)

… and thank you very much for all the info. That helps a lot.

Just to confirm, in the huggingface section of the Non-English NLU blogpost there’s this snippet.

- name: LanguageModelFeaturizer
  model_name: bert
  model_weights: asafaya/bert-base-arabic

The idea is that a bert-kind of huggingface model can be used in Rasa but that you’ll need to give it appropriate weights. Am I understanding it correctly that xml-roberta-base refers to a non-roberta model?

It’d help if you could share the config.yml file that you tried to run.

1 Like

@koaning, Adding bert based models works just fine. I’ve tried it with the following config.

language: si

pipeline:
  - name: "HFTransformersNLP"
    model_name: "roberta"
    model_weights: "keshan/SinhalaBERTo"
    cache_dir: "hf_lm_weights/bert_si"
  - name: "LanguageModelTokenizer"
  - name: "LanguageModelFeaturizer"
  - name: "LexicalSyntacticFeaturizer"
  - name: "CountVectorsFeaturizer"
  - name: "CountVectorsFeaturizer"
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: "CountVectorsFeaturizer"
    analyzer: "char"
    min_ngram: 3
    max_ngram: 5
  - name: "DIETClassifier"
    entity_recognition: true
    epochs: 300
  - name: "EntitySynonymMapper"
  - name: "ResponseSelector"
    epochs: 300
    retrieval_intent: faq

policies:
  - name: RulePolicy

My question is that is it possible to attach xml-roberta-base model in the same way? If I want to add it to the pipeline via LanguageModelFeaturizer, how do I have to specify model_name and model_weights? That’s where I’m stuck because I couldn’t find those parameters in the documentaion for xml-roberta based models.

From the top of my head; the xml-roberta-base would refer to the weights and the architecture/model_name would be roberta.

1 Like

Right! I’ll see if that works. I thought they are different. @koaning Thanks a lot for the help.

1 Like