Training crashed a server

Version: 53425e4-CC

Hello. We have a rasa x version 0.25.2 and rasa version 1.7.0 on aws ec2 server installed localy, e.g. via pip.

When I push train button, training starts and after some time train button becomes purple again. However, there is no new model in models page.

What we figured out is that when the training is actually done(you don’t know about that), if you push training button again, it will show training is done message and you model will appear in models page.

However, if you push training second time before first training is actually done, our server will stop responding, e.g. it will be under heavy load constantly, based on ec2 monitoring, and you won’t be able to access it via ssh or rasa x. The only way to stop it is to shutdown the server and start it again.

Also if you train manually from server it works ok and training finished in around 5 minutes.

I guess it is a bug, what do you think might be the problem or does someone else gets this? We have quite big bot I suppose.

I have noticed in past versions that logging out and logging back in again might help sometimes.

Also for context, what type of ec2 machine are you running (one with multiple CPUs?) and how large is the training data?

I’m running t2.medium, 2 vCPU, 4GB ram. I don’t know how multiple/single CPU can affect server(if you think that single cpu is working 100% on training, then this is very strange since OS restricts that). Training data is not very large, 8 Kb precisely. I wonder how logging out and logging back might help. Is it after server crashed or before? Because if it after, I can’t do that since I’m unable to enter ec2 or rasa x server.

I work at Rasa (I’m their Research Advocate) so I tend to use the bleeding version of Rasa X. I recall in the past there was a bug (can’t recall the version unfortunately) where some of the ui-elements would hiccup. Logging out and then logging in was a remedy.

Instead of 8Kb … how many different intents/entities? How many stories? Also … could I see your model pipeline? Are you doing things like pretrained-bert? 4Gb RAM isn’t a whole lot if there’s big models at play.

@gsp0din not sure if you’ve solved this since, but you say you installed Rasa X via pip? We generally don’t recommend that - have you looked at the options for deployment with e.g. docker compose?

Hello! Now we are trying to move to automatic deployment. I’m trying to use kubernetes one line deployment. So as expected it works, but I’m new to those containers and etc, so I have a couple of questions regarding that. First, is there a way to access files, for example domain, directly? Another question is how do I start, stop rasa x, rasa, and how do I specify what ports rasa should run on. Also can I specify RASA_X_PASSWORD directly, or initial_user_password is the only way? More generally, can I access rasa as a server? And what is the best practice to run action server? I had my file store at the rasa project folder, but for now I can’t access/don’t have one, so should I install rasa manually again and then run rasa actions?

Maybe this will help-> deploying Rasa X
It’s a RASA Masterclass video about deploying Rasa X in a cluster environment.

1 Like