2016-10-10 1 views
0

Je suis en cours d'exécution du programme spark dans Windows 10 machine.exemple de programme spark CSV et JSON ne s'exécutant pas dans Windows

Je suis en train d'exécuter le programme d'allumage ci-dessous

import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 
import org.apache.spark.sql.Column 
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.SQLContext 
import org.apache.spark.sql._ 
import org.apache.spark.sql.SQLImplicits 
import org.apache.spark.sql.expressions.Window 
import org.apache.spark.sql.TypedColumn 
import org.apache.spark.sql.Encoder 
import org.apache.spark.sql.Encoders 
import com.databricks.spark.csv 

object json1 { 

    def main(args : Array[String]){ 
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]").set("spark.executor.memory", "1g") 
    val sc = new org.apache.spark.SparkContext(conf) 
    val sqlc = new org.apache.spark.sql.SQLContext(sc) 

    val NyseDF = sqlc.load("com.databricks.spark.csv",Map("path" -> args(0),"header"->"true")) 
    NyseDF.registerTempTable("NYSE") 
    NyseDF.printSchema() 

    } 



} 

Quand je lance le programme en mode d'exécution de l'application dans Eclispse avec des arguments de passage

comme

src/test/resources/demo .text

Il échoue avec l'erreur ci-dessous.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
16/10/10 11:02:18 INFO SparkContext: Running Spark version 1.6.0 
16/10/10 11:02:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
16/10/10 11:02:18 INFO SecurityManager: Changing view acls to: subho 
16/10/10 11:02:18 INFO SecurityManager: Changing modify acls to: subho 
16/10/10 11:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subho); users with modify permissions: Set(subho) 
16/10/10 11:02:19 INFO Utils: Successfully started service 'sparkDriver' on port 61108. 
16/10/10 11:02:20 INFO Slf4jLogger: Slf4jLogger started 
16/10/10 11:02:20 INFO Remoting: Starting remoting 
16/10/10 11:02:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:61121] 
16/10/10 11:02:20 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 61121. 
16/10/10 11:02:20 INFO SparkEnv: Registering MapOutputTracker 
16/10/10 11:02:20 INFO SparkEnv: Registering BlockManagerMaster 
16/10/10 11:02:21 INFO DiskBlockManager: Created local directory at C:\Users\subho\AppData\Local\Temp\blockmgr-69afda02-ccd1-41d1-aa25-830ba366a75c 
16/10/10 11:02:21 INFO MemoryStore: MemoryStore started with capacity 1128.4 MB 
16/10/10 11:02:21 INFO SparkEnv: Registering OutputCommitCoordinator 
16/10/10 11:02:21 INFO Utils: Successfully started service 'SparkUI' on port 4040. 
16/10/10 11:02:21 INFO SparkUI: Started SparkUI at http://192.168.1.116:4040 
16/10/10 11:02:21 INFO Executor: Starting executor ID driver on host localhost 
16/10/10 11:02:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61132. 
16/10/10 11:02:21 INFO NettyBlockTransferService: Server created on 61132 
16/10/10 11:02:21 INFO BlockManagerMaster: Trying to register BlockManager 
16/10/10 11:02:21 INFO BlockManagerMasterEndpoint: Registering block manager localhost:61132 with 1128.4 MB RAM, BlockManagerId(driver, localhost, 61132) 
16/10/10 11:02:21 INFO BlockManagerMaster: Registered BlockManager 
16/10/10 11:02:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 107.7 KB, free 107.7 KB) 
16/10/10 11:02:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.8 KB, free 117.5 KB) 
16/10/10 11:02:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61132 (size: 9.8 KB, free: 1128.4 MB) 
16/10/10 11:02:23 INFO SparkContext: Created broadcast 0 from textFile at TextFile.scala:30 
16/10/10 11:02:23 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path 
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. 
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278) 
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300) 
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293) 
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76) 
    at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) 
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) 
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015) 
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) 
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) 
    at scala.Option.map(Option.scala:146) 
    at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) 
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:121) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:121) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1293) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.RDD.take(RDD.scala:1288) 
    at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:174) 
    at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:169) 
    at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:147) 
    at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:70) 
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:138) 
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40) 
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28) 
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) 
    at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153) 
    at json1$.main(json1.scala:22) 
    at json1.main(json1.scala) 
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/demo.text 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) 
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:121) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:121) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1293) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.RDD.take(RDD.scala:1288) 
    at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:174) 
    at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:169) 
    at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:147) 
    at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:70) 
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:138) 
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40) 
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28) 
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) 
    at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153) 
    at json1$.main(json1.scala:22) 
    at json1.main(json1.scala) 
16/10/10 11:02:23 INFO SparkContext: Invoking stop() from shutdown hook 
16/10/10 11:02:23 INFO SparkUI: Stopped Spark web UI at http://192.168.1.116:4040 
16/10/10 11:02:23 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 
16/10/10 11:02:23 INFO MemoryStore: MemoryStore cleared 
16/10/10 11:02:23 INFO BlockManager: BlockManager stopped 
16/10/10 11:02:24 INFO BlockManagerMaster: BlockManagerMaster stopped 
16/10/10 11:02:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 
16/10/10 11:02:24 INFO SparkContext: Successfully stopped SparkContext 
16/10/10 11:02:24 INFO ShutdownHookManager: Shutdown hook called 
16/10/10 11:02:24 INFO ShutdownHookManager: Deleting directory C:\Users\subho\AppData\Local\Temp\spark-7f53ea20-a38c-46d5-8476-a1ae040736ac 

Ci-dessous est l'erreur principale msg

chemin d'entrée n'existe pas: file:/C:/Users/Subho/Desktop/code maître/simple étincelle projet/src/test/ressources /demo.text

J'ai le fichier dans l'emplacement ci-dessous.

!] 1

Quand je courais le programme ci-dessous, il a couru sucussfully,

import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 
import org.apache.spark.sql.Column 
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.SQLContext 
import org.apache.spark.sql._ 
import org.apache.spark.sql.SQLImplicits 
import org.apache.spark.sql.expressions.Window 
import org.apache.spark.sql.TypedColumn 
import org.apache.spark.sql.Encoder 
import org.apache.spark.sql.Encoders 
import com.databricks.spark.csv 

object json1 { 

    def main(args : Array[String]){ 
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]").set("spark.executor.memory", "1g") 
    val sc = new org.apache.spark.SparkContext(conf) 
    val sqlc = new org.apache.spark.sql.SQLContext(sc) 

    /* val NyseDF = sqlc.load("com.databricks.spark.csv",Map("path" -> args(0),"header"->"true")) 
    NyseDF.registerTempTable("NYSE") 
    NyseDF.printSchema() 

    print(sqlc.sql("select distinct(symbol) from NYSE").collect().toList)*/ 

    val PersonDF = sqlc.jsonFile("src/test/resources/Person.json") 

    // PersonDF.printSchema() 

    PersonDF.registerTempTable("Person") 

    sqlc.sql("select * from Person where age < 60").collect().foreach(print) 

    } 

est en dessous du fichier journal.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
16/10/10 11:54:12 INFO SparkContext: Running Spark version 1.6.0 
16/10/10 11:54:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
16/10/10 11:54:13 INFO SecurityManager: Changing view acls to: subho 
16/10/10 11:54:13 INFO SecurityManager: Changing modify acls to: subho 
16/10/10 11:54:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subho); users with modify permissions: Set(subho) 
16/10/10 11:54:14 INFO Utils: Successfully started service 'sparkDriver' on port 51113. 
16/10/10 11:54:14 INFO Slf4jLogger: Slf4jLogger started 
16/10/10 11:54:14 INFO Remoting: Starting remoting 
16/10/10 11:54:15 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:51126] 
16/10/10 11:54:15 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 51126. 
16/10/10 11:54:15 INFO SparkEnv: Registering MapOutputTracker 
16/10/10 11:54:15 INFO SparkEnv: Registering BlockManagerMaster 
16/10/10 11:54:15 INFO DiskBlockManager: Created local directory at C:\Users\subho\AppData\Local\Temp\blockmgr-a52a5d5a-075b-4859-8434-935fdaba8538 
16/10/10 11:54:15 INFO MemoryStore: MemoryStore started with capacity 1128.4 MB 
16/10/10 11:54:15 INFO SparkEnv: Registering OutputCommitCoordinator 
16/10/10 11:54:15 INFO Utils: Successfully started service 'SparkUI' on port 4040. 
16/10/10 11:54:15 INFO SparkUI: Started SparkUI at http://192.168.1.116:4040 
16/10/10 11:54:15 INFO Executor: Starting executor ID driver on host localhost 
16/10/10 11:54:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51137. 
16/10/10 11:54:15 INFO NettyBlockTransferService: Server created on 51137 
16/10/10 11:54:15 INFO BlockManagerMaster: Trying to register BlockManager 
16/10/10 11:54:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:51137 with 1128.4 MB RAM, BlockManagerId(driver, localhost, 51137) 
16/10/10 11:54:15 INFO BlockManagerMaster: Registered BlockManager 
16/10/10 11:54:17 INFO JSONRelation: Listing file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json on driver 
16/10/10 11:54:17 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path 
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. 
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278) 
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300) 
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293) 
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76) 
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:447) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation.org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115) 
    at scala.Option.getOrElse(Option.scala:121) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109) 
    at scala.Option.getOrElse(Option.scala:121) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108) 
    at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636) 
    at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635) 
    at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) 
    at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244) 
    at org.apache.spark.sql.SQLContext.jsonFile(SQLContext.scala:1011) 
    at json1$.main(json1.scala:28) 
    at json1.main(json1.scala) 
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 128.0 KB, free 128.0 KB) 
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.1 KB, free 142.1 KB) 
16/10/10 11:54:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:51137 (size: 14.1 KB, free: 1128.4 MB) 
16/10/10 11:54:18 INFO SparkContext: Created broadcast 0 from jsonFile at json1.scala:28 
16/10/10 11:54:18 INFO FileInputFormat: Total input paths to process : 1 
16/10/10 11:54:18 INFO SparkContext: Starting job: jsonFile at json1.scala:28 
16/10/10 11:54:18 INFO DAGScheduler: Got job 0 (jsonFile at json1.scala:28) with 2 output partitions 
16/10/10 11:54:18 INFO DAGScheduler: Final stage: ResultStage 0 (jsonFile at json1.scala:28) 
16/10/10 11:54:18 INFO DAGScheduler: Parents of final stage: List() 
16/10/10 11:54:18 INFO DAGScheduler: Missing parents: List() 
16/10/10 11:54:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at jsonFile at json1.scala:28), which has no missing parents 
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KB, free 146.3 KB) 
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 148.6 KB) 
16/10/10 11:54:18 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:51137 (size: 2.4 KB, free: 1128.4 MB) 
16/10/10 11:54:18 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 
16/10/10 11:54:18 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at jsonFile at json1.scala:28) 
16/10/10 11:54:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 
16/10/10 11:54:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2113 bytes) 
16/10/10 11:54:18 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2113 bytes) 
16/10/10 11:54:18 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 
16/10/10 11:54:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 
16/10/10 11:54:18 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:0+92 
16/10/10 11:54:18 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:92+93 
16/10/10 11:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 
16/10/10 11:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 
16/10/10 11:54:18 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 
16/10/10 11:54:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 
16/10/10 11:54:18 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 
16/10/10 11:54:18 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 
16/10/10 11:54:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2886 bytes result sent to driver 
16/10/10 11:54:19 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2886 bytes result sent to driver 
16/10/10 11:54:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1287 ms on localhost (1/2) 
16/10/10 11:54:19 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1264 ms on localhost (2/2) 
16/10/10 11:54:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/10/10 11:54:19 INFO DAGScheduler: ResultStage 0 (jsonFile at json1.scala:28) finished in 1.314 s 
16/10/10 11:54:19 INFO DAGScheduler: Job 0 finished: jsonFile at json1.scala:28, took 1.413653 s 
16/10/10 11:54:20 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:51137 in memory (size: 2.4 KB, free: 1128.4 MB) 
16/10/10 11:54:20 INFO ContextCleaner: Cleaned accumulator 1 
16/10/10 11:54:20 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:51137 in memory (size: 14.1 KB, free: 1128.4 MB) 
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 59.6 KB, free 59.6 KB) 
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 13.8 KB, free 73.3 KB) 
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:51137 (size: 13.8 KB, free: 1128.4 MB) 
16/10/10 11:54:21 INFO SparkContext: Created broadcast 2 from collect at json1.scala:34 
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 128.0 KB, free 201.3 KB) 
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 14.1 KB, free 215.4 KB) 
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:51137 (size: 14.1 KB, free: 1128.3 MB) 
16/10/10 11:54:21 INFO SparkContext: Created broadcast 3 from collect at json1.scala:34 
16/10/10 11:54:21 INFO FileInputFormat: Total input paths to process : 1 
16/10/10 11:54:21 INFO SparkContext: Starting job: collect at json1.scala:34 
16/10/10 11:54:21 INFO DAGScheduler: Got job 1 (collect at json1.scala:34) with 2 output partitions 
16/10/10 11:54:21 INFO DAGScheduler: Final stage: ResultStage 1 (collect at json1.scala:34) 
16/10/10 11:54:21 INFO DAGScheduler: Parents of final stage: List() 
16/10/10 11:54:21 INFO DAGScheduler: Missing parents: List() 
16/10/10 11:54:21 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[9] at collect at json1.scala:34), which has no missing parents 
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 7.6 KB, free 223.0 KB) 
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.1 KB, free 227.1 KB) 
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:51137 (size: 4.1 KB, free: 1128.3 MB) 
16/10/10 11:54:21 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006 
16/10/10 11:54:21 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[9] at collect at json1.scala:34) 
16/10/10 11:54:21 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 
16/10/10 11:54:21 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2113 bytes) 
16/10/10 11:54:21 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2113 bytes) 
16/10/10 11:54:21 INFO Executor: Running task 0.0 in stage 1.0 (TID 2) 
16/10/10 11:54:21 INFO Executor: Running task 1.0 in stage 1.0 (TID 3) 
16/10/10 11:54:21 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:92+93 
16/10/10 11:54:21 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:0+92 
16/10/10 11:54:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:51137 in memory (size: 13.8 KB, free: 1128.4 MB) 
16/10/10 11:54:22 INFO GenerateUnsafeProjection: Code generated in 548.352258 ms 
16/10/10 11:54:22 INFO GeneratePredicate: Code generated in 5.245214 ms 
16/10/10 11:54:22 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 2283 bytes result sent to driver 
16/10/10 11:54:22 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2536 bytes result sent to driver 
16/10/10 11:54:22 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 755 ms on localhost (1/2) 
16/10/10 11:54:22 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 759 ms on localhost (2/2) 
16/10/10 11:54:22 INFO DAGScheduler: ResultStage 1 (collect at json1.scala:34) finished in 0.760 s 
16/10/10 11:54:22 INFO DAGScheduler: Job 1 finished: collect at json1.scala:34, took 0.779652 s 
16/10/10 11:54:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
[53,Barack,Obama]16/10/10 11:54:22 INFO SparkContext: Invoking stop() from shutdown hook 
16/10/10 11:54:22 INFO SparkUI: Stopped Spark web UI at http://192.168.1.116:4040 
16/10/10 11:54:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 
16/10/10 11:54:22 INFO MemoryStore: MemoryStore cleared 
16/10/10 11:54:22 INFO BlockManager: BlockManager stopped 
16/10/10 11:54:22 INFO BlockManagerMaster: BlockManagerMaster stopped 
16/10/10 11:54:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 
16/10/10 11:54:22 INFO SparkContext: Successfully stopped SparkContext 
16/10/10 11:54:22 INFO ShutdownHookManager: Shutdown hook called 
16/10/10 11:54:22 INFO ShutdownHookManager: Deleting directory C:\Users\subho\AppData\Local\Temp\spark-6cab6329-83f1-4af4-b64c-c869550405a4 
16/10/10 11:54:22 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 

Merci et salutations,

+1

cochez cette case [stackoverflow answer] (http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path) – VladoDemcak

Répondre

1

La section importante de la stacktrace est ici:

ERROR Shell: Failed to locate the winutils binary in the hadoop binary path 
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. 

Une possibilité est de télécharger winutils.exe (par exemple à partir here), mis en dans un dossier appelé bin (en tant que sous-répertoire ou votre répertoire personnel, par exemple C:\Users\XXX\bin\winutils.exe), puis ajoutez cette ligne au début de votre code:

System.setProperty("hadoop.home.dir", raw"C:\Users\XXX\") 
+0

Merci Raphael, il fonctionne bien maintenant.can u s'il vous plaît expliquer quel était le problème ici – subho