Training crashed a server

gsp0din · March 6, 2020, 4:46pm

Version: 53425e4-CC

Hello. We have a rasa x version 0.25.2 and rasa version 1.7.0 on aws ec2 server installed localy, e.g. via pip.

When I push train button, training starts and after some time train button becomes purple again. However, there is no new model in models page.

What we figured out is that when the training is actually done(you don’t know about that), if you push training button again, it will show training is done message and you model will appear in models page.

However, if you push training second time before first training is actually done, our server will stop responding, e.g. it will be under heavy load constantly, based on ec2 monitoring, and you won’t be able to access it via ssh or rasa x. The only way to stop it is to shutdown the server and start it again.

Also if you train manually from server it works ok and training finished in around 5 minutes.

I guess it is a bug, what do you think might be the problem or does someone else gets this? We have quite big bot I suppose.

koaning · March 9, 2020, 11:41am

I have noticed in past versions that logging out and logging back in again might help sometimes.

Also for context, what type of ec2 machine are you running (one with multiple CPUs?) and how large is the training data?

gsp0din · March 11, 2020, 11:42am

I’m running t2.medium, 2 vCPU, 4GB ram. I don’t know how multiple/single CPU can affect server(if you think that single cpu is working 100% on training, then this is very strange since OS restricts that). Training data is not very large, 8 Kb precisely. I wonder how logging out and logging back might help. Is it after server crashed or before? Because if it after, I can’t do that since I’m unable to enter ec2 or rasa x server.

koaning · March 11, 2020, 1:29pm

I work at Rasa (I’m their Research Advocate) so I tend to use the bleeding version of Rasa X. I recall in the past there was a bug (can’t recall the version unfortunately) where some of the ui-elements would hiccup. Logging out and then logging in was a remedy.

Instead of 8Kb … how many different intents/entities? How many stories? Also … could I see your model pipeline? Are you doing things like pretrained-bert? 4Gb RAM isn’t a whole lot if there’s big models at play.

akelad · March 16, 2020, 1:15pm

@gsp0din not sure if you’ve solved this since, but you say you installed Rasa X via pip? We generally don’t recommend that - have you looked at the options for deployment with e.g. docker compose?

gsp0din · March 19, 2020, 2:14pm

Hello! Now we are trying to move to automatic deployment. I’m trying to use kubernetes one line deployment. So as expected it works, but I’m new to those containers and etc, so I have a couple of questions regarding that. First, is there a way to access files, for example domain, directly? Another question is how do I start, stop rasa x, rasa, and how do I specify what ports rasa should run on. Also can I specify RASA_X_PASSWORD directly, or initial_user_password is the only way? More generally, can I access rasa as a server? And what is the best practice to run action server? I had my actions.py file store at the rasa project folder, but for now I can’t access/don’t have one, so should I install rasa manually again and then run rasa actions?

steve · March 20, 2020, 9:40am

Maybe this will help-> deploying Rasa X
It’s a RASA Masterclass video about deploying Rasa X in a cluster environment.

Topic		Replies	Views
I can't train my model [Deprecated] Rasa X Community Edition	6	1717	August 13, 2020
Training and model upload failed in Rasa-X [Deprecated] Rasa X Community Edition	16	2415	June 30, 2022
Rasa x training failed [Deprecated] Rasa X Community Edition	14	774	July 27, 2021
Rasa X behaving strangely since upgrading it to the latest version [Deprecated] Rasa X Community Edition	2	366	January 18, 2021
Rasa X Model training not working [Deprecated] Rasa X Community Edition	17	6659	November 14, 2020

Training crashed a server

Related topics