Je travaille sur MXNet, une bibliothèque d'apprentissage en profondeur. Ma structure implémentée est à la fois sur des machines à une seule machine et des machines à processeurs distribués. J'ai suivi le tutorial sur le site officiel de MXNet. L'implémentation sur une seule machine fonctionnait sans aucun problème et j'ai obtenu le résultat.Autorisation refusée dans une formation Mxnet distribuée
Puis j'ai essayé une formation distribuée avec plusieurs machines CPU. J'ai créé un compte sur AWS, machine virtuelle amazon, et lancé 3 de t2.micro ubuntu. Je la ligne typée de commande suivante:
../../tools/launch.py -n 2 python train_mnist.py --kv-store dist_sync
Cette ligne de commande ci-dessus suppose que pour exécuter une formation de version distribuée avec 2 travailleurs et 1 serveur.
Malheureusement, j'ai eu une erreur. Je crois comprendre qu'il ya une permission refusée, mais j'ai essayé d'accéder à deux autres travailleurs du serveur par la commande suivante et il fonctionne:
ssh -i key.pem [email protected] number.
Voici l'erreur
Permission denied (publickey).
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
subprocess.check_call(prog, shell = True)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_WORKER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255
Permission denied (publickey).
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
subprocess.check_call(prog, shell = True)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255
[03:48:39] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:300: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]
Traceback (most recent call last):
File "train_mnist.py", line 76, in <module>
fit.fit(args, sym, get_mnist_iter)
File "/home/ubuntu/Research/code/mxnet/example/image-classification/common/fit.py", line 97, in fit
kv = mx.kvstore.create(args.kv_store)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/kvstore.py", line 403, in create
ctypes.byref(handle)))
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/base.py", line 77, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 363, in <lambda>
target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1