actuellement en cours d'exécution de la formation en utilisant mnist-deep.py du tutoriel Tensorflow sur Geforce 1080 (8 Go) avec 16 Go de RAM sur la machine. Toutes les dernières bibliothèques et pilotes CUDA sont installés. Tout fonctionne sur Tensorflow 1.3. Le script mnist-deep.py a fonctionné correctement sans aucune erreur jusqu'à ce que je décide d'effectuer la formation d'une formation Keras vdsr (https://github.com/jackie840129/VDSR-reduction_with-Keras). Formation pendue et GPU perdue (pas d'accès via nvidia-smi). Après le redémarrage essayait d'exécuter le mnist-deep.py et obtenant les erreurs ci-dessous constamment. Je ne sais toujours pas ce qui pourrait causer le problème. Redémarrez, réinstallez cuda ne semble pas résoudre les problèmes. Réimager la machine semble résoudre le problème mais cela ne semble pas être pratique. Des idées sur ce qui pourrait causer le problème à la première place et comment le résoudre pour de bon?Tensorflow crash dans mnist-deep.py
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Saving graph to: /tmp/tmpgb1l75z_
2017-10-18 15:36:28.098787: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.098807: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.098814: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX instructions, but these are
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.098820: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX2 instructions, but these are
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.098825: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use FMA instructions, but these are
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.760202: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero
2017-10-18 15:36:28.760643: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0
with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
2017-10-18 15:36:28.760657: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-10-18 15:36:28.760664: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-10-18 15:36:28.760672: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci
bus id: 0000:01:00.0)
2017-10-18 15:36:31.546892: E
tensorflow/stream_executor/cuda/cuda_driver.cc:1073] failed to get
elapsed time between events: CUDA_ERROR_NOT_READY
2017-10-18 15:36:32.547035: E
tensorflow/stream_executor/cuda/cuda_driver.cc:1073] failed to get
elapsed time between events: CUDA_ERROR_NOT_READY
2017-10-18 15:36:32.549299: E
tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create
cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-10-18 15:36:32.549317: W
tensorflow/stream_executor/stream.cc:1756] attempting to perform BLAS
operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/errors_impl.py", line 466, in
raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM
launch failed : a.shape=(50, 3136), b.shape=(3136, 1024), m=50,
n=1024, k=3136
[[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false,
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"]
(fc1/Reshape, fc1/Variable/read)]]
[[Node: Mean_1/_7 = _Recv[client_terminated=false,
recv_device="/job:localhost/replica:0/task:0/cpu:0",
send_device="/job:localhost/replica:0/task:0/gpu:0",
send_device_incarnation=1, tensor_name="edge_79_Mean_1",
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]
()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mnist_deep.py", line 178, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "mnist_deep.py", line 165, in main
x: batch[0], y_: batch[1], keep_prob: 1.0})
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 541, in eval
return _eval_using_default_session(self, feed_dict, self.graph,
session)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 4085, in
_eval_using_default_session
return session.run(tensors, feed_dict)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM
launch failed : a.shape=(50, 3136), b.shape=(3136, 1024), m=50,
n=1024, k=3136
[[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false,
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"]
(fc1/Reshape, fc1/Variable/read)]]
[[Node: Mean_1/_7 = _Recv[client_terminated=false,
recv_device="/job:localhost/replica:0/task:0/cpu:0",
send_device="/job:localhost/replica:0/task:0/gpu:0",
send_device_incarnation=1, tensor_name="edge_79_Mean_1",
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]
()]]
Caused by op 'fc1/MatMul', defined at:
File "mnist_deep.py", line 178, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "mnist_deep.py", line 134, in main
y_conv, keep_prob = deepnn(x)
File "mnist_deep.py", line 83, in deepnn
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/ops/math_ops.py", line 1844, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/ops/gen_math_ops.py", line 1289, in
_mat_mul
transpose_b=transpose_b, name=name)
File "/home/nmh/env/lib/python3.6/site
/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint:
disable=protected-access
InternalError (see above for traceback): Blas GEMM launch failed :
a.shape=(50, 3136), b.shape=(3136, 1024), m=50, n=1024, k=3136
[[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false,
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"]
(fc1/Reshape, fc1/Variable/read)]]
[[Node: Mean_1/_7 = _Recv[client_terminated=false,
recv_device="/job:localhost/replica:0/task:0/cpu:0",
send_device="/job:localhost/replica:0/task:0/gpu:0",
send_device_incarnation=1, tensor_name="edge_79_Mean_1",
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]
()]]