2017-10-18 16 views
0

actuellement en cours d'exécution de la formation en utilisant mnist-deep.py du tutoriel Tensorflow sur Geforce 1080 (8 Go) avec 16 Go de RAM sur la machine. Toutes les dernières bibliothèques et pilotes CUDA sont installés. Tout fonctionne sur Tensorflow 1.3. Le script mnist-deep.py a fonctionné correctement sans aucune erreur jusqu'à ce que je décide d'effectuer la formation d'une formation Keras vdsr (https://github.com/jackie840129/VDSR-reduction_with-Keras). Formation pendue et GPU perdue (pas d'accès via nvidia-smi). Après le redémarrage essayait d'exécuter le mnist-deep.py et obtenant les erreurs ci-dessous constamment. Je ne sais toujours pas ce qui pourrait causer le problème. Redémarrez, réinstallez cuda ne semble pas résoudre les problèmes. Réimager la machine semble résoudre le problème mais cela ne semble pas être pratique. Des idées sur ce qui pourrait causer le problème à la première place et comment le résoudre pour de bon?Tensorflow crash dans mnist-deep.py

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. 
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz 
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. 
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz 
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. 
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz 
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. 
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz 
Saving graph to: /tmp/tmpgb1l75z_ 
2017-10-18 15:36:28.098787: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use SSE4.1 instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.098807: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use SSE4.2 instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.098814: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use AVX instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.098820: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use AVX2 instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.098825: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use FMA instructions, but these are 
available on your machine and could speed up CPU computations. 
2017-10-18 15:36:28.760202: I 
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful 
NUMA node read from SysFS had negative value (-1), but there must be 
at least one NUMA node, so returning NUMA node zero 
2017-10-18 15:36:28.760643: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 
with properties: 
name: GeForce GTX 1080 
major: 6 minor: 1 memoryClockRate (GHz) 1.7715 
pciBusID 0000:01:00.0 
Total memory: 7.92GiB 
Free memory: 7.81GiB 
2017-10-18 15:36:28.760657: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-10-18 15:36:28.760664: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 
2017-10-18 15:36:28.760672: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating 
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci 
bus id: 0000:01:00.0) 
2017-10-18 15:36:31.546892: E 
tensorflow/stream_executor/cuda/cuda_driver.cc:1073] failed to get 
elapsed time between events: CUDA_ERROR_NOT_READY 
2017-10-18 15:36:32.547035: E 
tensorflow/stream_executor/cuda/cuda_driver.cc:1073] failed to get 
elapsed time between events: CUDA_ERROR_NOT_READY 
2017-10-18 15:36:32.549299: E 
tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create 
cublas handle: CUBLAS_STATUS_NOT_INITIALIZED 
2017-10-18 15:36:32.549317: W 
tensorflow/stream_executor/stream.cc:1756] attempting to perform BLAS 
operation using StreamExecutor without BLAS support 
Traceback (most recent call last): 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1327, in _do_call 
return fn(*args) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1306, in _run_fn 
status, run_metadata) 
File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__ 
next(self.gen) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/errors_impl.py", line 466, in 
raise_exception_on_not_ok_status 
pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM 
launch failed : a.shape=(50, 3136), b.shape=(3136, 1024), m=50, 
n=1024, k=3136 
[[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"] 
(fc1/Reshape, fc1/Variable/read)]] 
[[Node: Mean_1/_7 = _Recv[client_terminated=false, 
recv_device="/job:localhost/replica:0/task:0/cpu:0", 
send_device="/job:localhost/replica:0/task:0/gpu:0", 
send_device_incarnation=1, tensor_name="edge_79_Mean_1", 
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"] 
()]] 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
File "mnist_deep.py", line 178, in <module> 
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/platform/app.py", line 48, in run 
_sys.exit(main(_sys.argv[:1] + flags_passthrough)) 
File "mnist_deep.py", line 165, in main 
x: batch[0], y_: batch[1], keep_prob: 1.0}) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/ops.py", line 541, in eval 
return _eval_using_default_session(self, feed_dict, self.graph, 
session) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/ops.py", line 4085, in 
_eval_using_default_session 
return session.run(tensors, feed_dict) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 895, in run 
run_metadata_ptr) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1124, in _run 
feed_dict_tensor, options, run_metadata) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1321, in _do_run 
options, run_metadata) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/client/session.py", line 1340, in _do_call 
raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM 
launch failed : a.shape=(50, 3136), b.shape=(3136, 1024), m=50, 
n=1024, k=3136 
[[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"] 
(fc1/Reshape, fc1/Variable/read)]] 
[[Node: Mean_1/_7 = _Recv[client_terminated=false, 
recv_device="/job:localhost/replica:0/task:0/cpu:0", 
send_device="/job:localhost/replica:0/task:0/gpu:0", 
send_device_incarnation=1, tensor_name="edge_79_Mean_1", 
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"] 
()]] 

Caused by op 'fc1/MatMul', defined at: 
File "mnist_deep.py", line 178, in <module> 
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/platform/app.py", line 48, in run 
_sys.exit(main(_sys.argv[:1] + flags_passthrough)) 
File "mnist_deep.py", line 134, in main 
y_conv, keep_prob = deepnn(x) 
File "mnist_deep.py", line 83, in deepnn 
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/ops/math_ops.py", line 1844, in matmul 
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/ops/gen_math_ops.py", line 1289, in 
_mat_mul 
transpose_b=transpose_b, name=name) 
File "/home/nmh/env/lib/python3.6/site 
/tensorflow/python/framework/op_def_library.py", line 767, in apply_op 
op_def=op_def) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/ops.py", line 2630, in create_op 
original_op=self._default_original_op, op_def=op_def) 
File "/home/nmh/env/lib/python3.6/site- 
packages/tensorflow/python/framework/ops.py", line 1204, in __init__ 
self._traceback = self._graph._extract_stack() # pylint: 
disable=protected-access 

InternalError (see above for traceback): Blas GEMM launch failed : 
a.shape=(50, 3136), b.shape=(3136, 1024), m=50, n=1024, k=3136 
[[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"] 
(fc1/Reshape, fc1/Variable/read)]] 
[[Node: Mean_1/_7 = _Recv[client_terminated=false, 
recv_device="/job:localhost/replica:0/task:0/cpu:0", 
send_device="/job:localhost/replica:0/task:0/gpu:0", 
send_device_incarnation=1, tensor_name="edge_79_Mean_1", 
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"] 
()]] 

Répondre

0

J'ai eu la même erreur une fois. Il a été provoqué par une erreur Out Of Memory (l'OS a tué mon entraînement à cause de la RAM qu'il a pris), puisque c'est assez violent j'ai aussi perdu le contact avec le GPU. Certains redémarrages et la suppression du GPU - le remettre en marche a fonctionné. Vous pouvez regarder this question pour savoir si votre problème est le même. Si c'est le cas, vous devrez probablement utiliser un réseau plus petit.