Recommended AWS setup for speeding up training

Is there a recommended AWS setup (instances and environment) for speeding up training? In this thread, I see that GPUs aren’t required. Then, in this thread, I see that people have reported GPUs speeding up training a ton. So I think I definitely want a GPU instance. Which instance to use, though, or which set of instances should I test out to get an idea of the tradeoff between training speed and cost for my setting? I should be using EC2, right? Or SageMaker or something else?

Surely, Rasa would want to provide the option to host training as part of their business model like Hugging Face, no?

Surely, Rasa would want to provide the option to host training as part of their business model like Hugging Face, no?

Many of our users/clients actually prefer to run the training on their own hardware and don’t like the idea of sending their data to a 3rd party. I don’t know what Rasa may or may not do in the future, but I can confirm that many folks don’t need this use-case.

In terms of GPU, I haven’t felt the need for one. If you’re getting started your main concern is usually to collect a representative dataset. Otherwise your validation statistics might not reflect the behavior of your users.

How big is your dataset now? Do you really need a GPU? Rasa definitely supports GPUs to train the Tensorflow models that we have but it’s usually not the primary concern of folks.

Gotcha :+1:

Do you really need a GPU?

I’ve read this advice in several threads, but this framing isn’t really how I think about the problem. The answer to “do I need a GPU to train the model” is “no”, but the answer to “is it worth it to have a GPU?” is “yes”. Here’s how I think about it: do I value the time in my development cycle that I save by switching to GPU more than the monetary cost I pay AWS. For my case, the answer to this type of time-saving question is nearly always “yes”. I’m guessing this is why, at many good deep learning research institutes/companies, they make it very easy to access the GPU cluster.

It’s fair that GPUs make things faster, but I might consider a few downsides.

  1. Although DIET/TED benefit from a speedup, all the other components do not. The featurisation pipeline typically uses scikit-learn under the hood which is bound to the CPU. This puts a cap on the expected speedup.
  2. By only running a CPU, you can consider using GitHub Actions as part of your CI/CD pipeline that generates a model as an artifact. Sure you could hook up a custom runner with a GPU, but this would cause an overhead.
  3. In my experience, setting up proper GPU instances that scale on/off exactly when you want in a way that there is no maintenance isn’t trivial. You can have less training time, but you may increase development time maintaining yet another AWS component.

I should admit; I’m pretty old-school in terms of tooling and part of what I’m mentioning is preferences for sure … but I also collaborate with the research team and even there I have personally never really felt the need to invest in a GPU.

1 Like

That said, @dakshvar22 would you have any advice for running Rasa on AWS with a GPU?

1 Like

@bradyneal You can use any of the g / p series instances that come as part of EC2 instances. We are planning on releasing the GPU supported docker image soon which would help simplify the installation of rasa on GPU machines.


Any rough idea when you plan to release the GPU-supported Docker image?

1 Like

There is no one-size-fits-all answer to this question, as the best AWS setup for speeding up training will vary depending on your specific needs and requirements. However, some general tips that may be helpful include using a GPU instance for training, as this can often speed up training significantly. Additionally, using EC2 instances for training is often recommended, as this can give you more control over the training process. I don’t know if you still found an answer, but if not, I suggest you reach out for help to the guys from Ultimately, the best way to determine which AWS setup is best for you is to experiment with different configurations and see what works best for your particular situation.