1

Je continue à rencontrer quelques bogues étranges tout en utilisant des valeurs différentes pour le paramètre layers [] de MultilayerPerceptronClassifier.MultilayerPerceptronClassifier dans Spark. Calques et erreurs étranges

par exemple. pour les mêmes données:

int[] layers = {100, 98, 2} 
new MultilayerPerceptronClassifier().setLayers(layers).setLabelCol(targetColumn).fit(data); 

Je reçois: java.lang.ArrayIndexOutOfBoundsException

With stack trace: 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) 
     at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
     at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) 
     at scala.Option.foreach(Option.scala:257) 
     at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) 
     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930) 
     at org.apache.spark.rdd.RDD.count(RDD.scala:1134) 
     at org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:195) 
     at org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:142) 
     at org.apache.spark.ml.ann.FeedForwardTrainer.train(Layer.scala:819) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassifier.train(MultilayerPerceptronClassifier.scala:262) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassifier.train(MultilayerPerceptronClassifier.scala:147) 

Maintenant, je passe à

int[] layers = {10,8,2} 

tout semble fonctionner. Maintenant, la prochaine tentative est:

int[] layers = {9,6,2} 

et a obtenu la sortie qui ont l'air beaucoup plus bizarre:

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double) 
     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) 
     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 
     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) 
     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) 
     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) 
     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
     at org.apache.spark.scheduler.Task.run(Task.scala:86) 
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 
Caused by: java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch! 
     at scala.Predef$.require(Predef.scala:224) 
     at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41) 
     at org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:164) 
     at org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:483) 
     at org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:530) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.predict(MultilayerPerceptronClassifier.scala:322) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.predict(MultilayerPerceptronClassifier.scala:296) 
     at org.apache.spark.ml.PredictionModel$$anonfun$1.apply(Predictor.scala:187) 
     at org.apache.spark.ml.PredictionModel$$anonfun$1.apply(Predictor.scala:186) 
     ... 16 more 
17/02/08 12:55:34 WARN TaskSetManager: Lost task 0.0 in stage 68.0 (TID 68, localhost): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double) 
     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) 
     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 
     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) 
     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) 
     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) 
     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
     at org.apache.spark.scheduler.Task.run(Task.scala:86) 
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 
Caused by: java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch! 
     at scala.Predef$.require(Predef.scala:224) 
     at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41) 
     at org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:164) 
     at org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:483) 
     at org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:530) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.predict(MultilayerPerceptronClassifier.scala:322) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.predict(MultilayerPerceptronClassifier.scala:296) 
     at org.apache.spark.ml.PredictionModel$$anonfun$1.apply(Predictor.scala:187) 
     at org.apache.spark.ml.PredictionModel$$anonfun$1.apply(Predictor.scala:186) 
     ... 16 more 

17/02/08 12:55:34 ERROR TaskSetManager: Task 0 in stage 68.0 failed 1 times; aborting job 
17/02/08 12:55:34 INFO TaskSchedulerImpl: Removed TaskSet 68.0, whose tasks have all completed, from pool 
17/02/08 12:55:34 INFO TaskSchedulerImpl: Cancelling stage 68 
17/02/08 12:55:34 INFO DAGScheduler: ResultStage 68 (show at DataPipeline.java:213) failed in 0,910 s 
17/02/08 12:55:34 INFO DAGScheduler: Job 67 failed: show at DataPipeline.java:213, took 0,914385 s 
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 68.0 failed 1 times, most recent failure: Lost task 0.0 in stage 68.0 (TID 68, localhost): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double) 
     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) 
     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 
     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) 
     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) 
     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) 
     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
     at org.apache.spark.scheduler.Task.run(Task.scala:86) 
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 
Caused by: java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch! 
     at scala.Predef$.require(Predef.scala:224) 
     at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41) 
     at org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:164) 
     at org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:483) 
     at org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:530) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.predict(MultilayerPerceptronClassifier.scala:322) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.predict(MultilayerPerceptronClassifier.scala:296) 
     at org.apache.spark.ml.PredictionModel$$anonfun$1.apply(Predictor.scala:187) 
     at org.apache.spark.ml.PredictionModel$$anonfun$1.apply(Predictor.scala:186) 
     ... 16 more 

Driver stacktrace: 
     at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) 
     at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
     at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) 
     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) 
     at scala.Option.foreach(Option.scala:257) 
     at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622) 
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611) 
     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916) 
     at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) 
     at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) 
     at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2193) 
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) 
     at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) 
     at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192) 
     at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2199) 
     at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1935) 
     at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1934) 
     at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2576) 
     at org.apache.spark.sql.Dataset.head(Dataset.scala:1934) 
     at org.apache.spark.sql.Dataset.take(Dataset.scala:2149) 
     at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) 
     at org.apache.spark.sql.Dataset.show(Dataset.scala:526) 
     at org.apache.spark.sql.Dataset.show(Dataset.scala:486) 
     at org.apache.spark.sql.Dataset.show(Dataset.scala:495) 
     at org.sparkexample.DataPipeline.trainNeuralNetwork(DataPipeline.java:213) 
     at org.sparkexample.DataPipeline.selectModel(DataPipeline.java:184) 
     at org.sparkexample.DataPipeline.main(DataPipeline.java:131) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:498) 
     at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) 
     at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) 
     at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) 
     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) 
     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double) 
     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) 
     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 
     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) 
     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) 
     at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) 
     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
     at org.apache.spark.scheduler.Task.run(Task.scala:86) 
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 
Caused by: java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch! 
     at scala.Predef$.require(Predef.scala:224) 
     at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41) 
     at org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:164) 
     at org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:483) 
     at org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:530) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.predict(MultilayerPerceptronClassifier.scala:322) 
     at org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel.predict(MultilayerPerceptronClassifier.scala:296) 
     at org.apache.spark.ml.PredictionModel$$anonfun$1.apply(Predictor.scala:187) 
     at org.apache.spark.ml.PredictionModel$$anonfun$1.apply(Predictor.scala:186) 
     ... 16 more 

Alors qu'est-ce exactement dois-je passer aux couches. De docs je vois que le dernier paramètre est le nombre de classes, et le reste est un tableau arbitraire de différents neurones.

Le montant réel des caractéristiques que j'ai et passe comme 1 caractéristique vecteur est 9

Répondre

0

découvert expérimentalement, que la quantité requise de neurones pour l'entrée est

numFeatures + 1

donc mon hypothèse est que +1 est à cause de predictionCol.

Étrange, puisque Prepare data for MultilayerPerceptronClassifier in scala recommande seulement numFeatures quantité de neurones