2016-07-11 7 views
2

Lorsque j'installe tensorflow avec succès sur un cluster, je lance immédiatement une démonstration de mnist pour vérifier si ça se passe bien, mais là j'ai trouvé un problème. Je ne sais pas ce qui est tout cela au sujet, mais il semble que l'erreur provient de CUDAerreur de course tensorflow avec cublas

python3 -m tensorflow.models.image.mnist.convolutional 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 
Extracting data/train-images-idx3-ubyte.gz 
Extracting data/train-labels-idx1-ubyte.gz 
Extracting data/t10k-images-idx3-ubyte.gz 
Extracting data/t10k-labels-idx1-ubyte.gz 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K20m 
major: 3 minor: 5 memoryClockRate (GHz) 0.7055 
pciBusID 0000:03:00.0 
Total memory: 5.00GiB 
Free memory: 4.92GiB 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K20m, pci bus id: 0000:03:00.0) 
Initialized! 
E tensorflow/stream_executor/cuda/cuda_blas.cc:461] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED 
Traceback (most recent call last): 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 715, in _do_call 
return fn(*args) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 697, in _run_fn 
status, run_metadata) 
    File "/home/gpuusr/local/lib/python3.5/contextlib.py", line 66, in __exit__ 
next(self.gen) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status 
pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136 
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]] 
[[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
    File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main 
"__main__", mod_spec) 
    File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code 
exec(code, run_globals) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module> 
tf.app.run() 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run 
sys.exit(main(sys.argv)) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 294, in main 
feed_dict=feed_dict) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 372, in run 
run_metadata_ptr) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 636, in _run 
feed_dict_string, options, run_metadata) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 708, in _do_run 
target_list, options, run_metadata) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 728, in _do_call 
raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136 
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]] 
[[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] 
Caused by op 'MatMul', defined at: 
    File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main 
"__main__", mod_spec) 
    File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code 
exec(code, run_globals) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module> 
tf.app.run() 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run 
sys.exit(main(sys.argv)) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 221, in main 
logits = model(train_data_node, True) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 213, in model 
hidden = tf.nn.relu(tf.matmul(reshape, fc1_weights) + fc1_biases) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1209, in matmul 
name=name) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1178, in _mat_mul 
transpose_b=transpose_b, name=name) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op 
op_def=op_def) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op 
original_op=self._default_original_op, op_def=op_def) 
    File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__ 
self._traceback = _extract_stack() 

Segmentation fault (core dumped) 
+0

Afin de générer ou d'exécuter TensorFlow avec la prise en charge du GPU, les outils Cuda Toolkit de NVIDIA (> = 7.0) et cuDNN (> = v2) doivent être installés. La prise en charge du GPU TensorFlow nécessite d'avoir une carte GPU avec NVidia Compute Capability> = 3.0. avez-vous suivi la configuration officcial? https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html – userfi

+0

absolument oui, ma version cuda est 7.5 et la version cudnn est v4 –

+0

ok, et votre carte graphique a une capacité supérieure ou égale à 3.0 ? – userfi

Répondre

1

J'ai eu exactement la même erreur que dans LD_LIBRARY_PATH je cuda 5.5 devant 7.5. Après avoir bougé de 7,5 en 5,5, tout fonctionne bien maintenant.