2
Lorsque j'installe tensorflow avec succès sur un cluster, je lance immédiatement une démonstration de mnist pour vérifier si ça se passe bien, mais là j'ai trouvé un problème. Je ne sais pas ce qui est tout cela au sujet, mais il semble que l'erreur provient de CUDAerreur de course tensorflow avec cublas
python3 -m tensorflow.models.image.mnist.convolutional
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Tesla K20m
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:03:00.0
Total memory: 5.00GiB
Free memory: 4.92GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K20m, pci bus id: 0000:03:00.0)
Initialized!
E tensorflow/stream_executor/cuda/cuda_blas.cc:461] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 715, in _do_call
return fn(*args)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 697, in _run_fn
status, run_metadata)
File "/home/gpuusr/local/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]]
[[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module>
tf.app.run()
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 294, in main
feed_dict=feed_dict)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]]
[[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op 'MatMul', defined at:
File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module>
tf.app.run()
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 221, in main
logits = model(train_data_node, True)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 213, in model
hidden = tf.nn.relu(tf.matmul(reshape, fc1_weights) + fc1_biases)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1209, in matmul
name=name)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1178, in _mat_mul
transpose_b=transpose_b, name=name)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
op_def=op_def)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()
Segmentation fault (core dumped)
Afin de générer ou d'exécuter TensorFlow avec la prise en charge du GPU, les outils Cuda Toolkit de NVIDIA (> = 7.0) et cuDNN (> = v2) doivent être installés. La prise en charge du GPU TensorFlow nécessite d'avoir une carte GPU avec NVidia Compute Capability> = 3.0. avez-vous suivi la configuration officcial? https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html – userfi
absolument oui, ma version cuda est 7.5 et la version cudnn est v4 –
ok, et votre carte graphique a une capacité supérieure ou égale à 3.0 ? – userfi