Segmentation fault (core dumped) on AWS GPU while training

ramreddymettu · January 22, 2019, 1:06pm

I’m trying to train intent classification model using Tensorflow embedding pipeline. We have two datasets with 2500 and 17000 utterances respectively.

With 2500 utterances dataset, I’m able to train the model, but for 17000 utterances dataset model, i’m facing Segmentation fault error as follows:

name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 0000:00:1e.0Segmentation fault totalMemory: 11.17GiB freeMemory: 11.10GiB 2019-01-22 11:22:03.435794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-01-22 11:22:12.477504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-22 11:22:12.477554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-01-22 11:22:12.477564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-01-22 11:22:12.481233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) Epochs: 0%| | 0/300 [00:00<?, ?it/s]2019-01-22 11:23:03.791310: W tensorflow/core/framework/allocator.cc:108] Allocation of 17455684000 exceeds 10% of system memory. train_models.sh: line 5: 8420 Segmentation fault (core dumped) python -m rasa_nlu.train -o ${model_dir} -d ${tr_file} -c ${config_file} --project nlu --fixed_model_name model_1

But I observed while training:

2500 utterances dataset using 3GB RAM (CPU) Memory but same dataset using 10.5 GB GPU memory out of 12GB.

17000 utterances dataset using 6GB RAM (CPU) Memory but same dataset giving Segmentation fault error.

I think its too odd because it is using three time GPU Memory than CPU. Is there any reason for this odd behaviour?

Is there any solution to avoid Segmentation fault error on big datasets when GPU Instances.

Topic		Replies	Views
Segmentation fault Rasa Open Source	0	294	February 28, 2020
Segmentation fault (core dumped) Rasa Open Source	9	3037	April 10, 2020
Odd problem of "rasa train" - Segmentation fault (core dumped) Rasa Open Source	8	2352	February 25, 2020
NLU training failing Rasa Open Source	5	1892	November 20, 2019
Error while using Tensorflow GPU 1.14.0 Rasa Open Source	15	4934	February 6, 2020

Segmentation fault (core dumped) on AWS GPU while training

Related topics