Je tente d'exécuter un script pyspark sur BigInsights on Cloud 4.2 Enterprise qui accède à une table Hive.Le travail de cluster spark fils-cluster échoue avec: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
D'abord, je crée la table de ruche:
[[email protected] ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Puis-je créer un simple script pyspark:
[[email protected] ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print(pokesRdd.collect())
je tente d'exécuter avec:
spark-submit --master yarn-cluster test_pokes.py
Cependant, je rencontre l'erreur:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 683, in _ssql_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 692, in _get_hive_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 1064, in __call__
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
...
...
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
...
at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
J'ai vu un certain nombre de messages similaires pour d'autres distributions Hadoop, mais pas pour BigInsights on Cloud.
La table fait simplement n'existe pas, vous pourriez manquer la base de données. Par exemple, 'web.pokes' ou quelque chose. Vous pourriez vouloir démarrer un shell 'spark-sql' et interroger' SHOW DATABASES'. –
Un comportement étrange se passe. Il n'existe que lorsque l'étincelle fonctionne sur un cluster de fils - voir http://stackoverflow.com/questions/41263670/spark-hive-reporting-pyspark-sql-utils-analysisexception-out-not-found-xxx –