2017-07-15 12 views
2

Je pratique Apache Pig. À l'aide de l'opérateur DEFINE et STREAM, je veux diffuser un fichier en utilisant un script python et obtenir une sortie éditée.Apache Pig: DEFINE STREAM Erreur lors de l'utilisation du code Python

Below is the file I am using. 

[[email protected] ~]$ cat data/movies_data.csv 
1,The Nightmare Before Christmas,1993,3.9,4568 
2,The Mummy,1932,3.5,4388 
3,Orphans of the Storm,1921,3.2,9062 
4,The Object of Beauty,1991,2.8,6150 
5,Night Tide,1963,2.8,5126 
6,One Magic Christmas,1985,3.8,5333 
7,Muriels Wedding,1994,3.5,6323 
8,Mothers Boys,1994,3.4,5733 
9,Nosferatu Original Version,1929,3.5,5651 
10,Nick of Time,1995,3.4,5333 

La sortie I attendu de porc en utilisant python est le premier multiple de la valeur du champ à 10, des deuxièmes données de terrain convertir en majuscules et troisième augmentation de l'année sur le terrain par 1.

sortie prévue de l'échantillon:

10,THE NIGHTMARE BEFORE CHRISTMAS,1994,3.9,4568 
20,THE MUMMY,1933,3.5,4388 

Mon code Python que j'ai utilisé dans

[[email protected] ~]$cat testpy22.py 

#!/usr/bin/python 

import sys 
import string 

for line in sys.stdin: 
    (f1,f2,f3,f4,f5)=str(line).strip().split(",") 
    f1 = f1*10 
    f2 = f2.upper() 
    f3 = f3+1 

    print"%d\t%s\t%d\t%.2f\t%d"%(f1,f2,f3,f4,f5) 

Et ci-dessous sont les Le code de porc que j'essaye:

grunt> a = load '/home/cloudera/data/movies_data.csv' using PigStorage(',') as (id:chararray,movie:chararray,year:chararray, point:chararray,code:chararray); 

grunt> dump a;  

Output(s): 
Successfully stored records in: "file:/tmp/temp-947273140/tmp1180787799" 

Job DAG: 
job_local1521960706_0008 


2017-07-15 04:56:26,250 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 
2017-07-15 04:56:26,250 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized 
2017-07-15 04:56:26,251 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 
2017-07-15 04:56:26,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 
(1,The Nightmare Before Christmas,1993,3.9,4568) 
(2,The Mummy,1932,3.5,4388) 
(3,Orphans of the Storm,1921,3.2,9062) 
(4,The Object of Beauty,1991,2.8,6150) 
(5,Night Tide,1963,2.8,5126) 
(6,One Magic Christmas,1985,3.8,5333) 
(7,Muriels Wedding,1994,3.5,6323) 
(8,Mothers Boys,1994,3.4,5733) 
(9,Nosferatu Original Version,1929,3.5,5651) 
(10,Nick of Time,1995,3.4,5333) 


grunt> DEFINE testpy22 `testpy22.py` SHIP('/home/cloudera/testpy22.py'); 

grunt> aaa = STREAM a through testpy22;    

grunt> dump aaa; 

Quand je jette les données, j'obtiens l'erreur suivante. Je suppose que l'erreur est due au code Python. Mais je ne suis pas capable de trouver le problème.

2017-07-15 04:58:37,718 [pool-9-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/home/cloudera/data/movies_data.csv:0+344 
2017-07-15 04:58:37,736 [pool-9-thread-1] WARN org.apache.hadoop.conf.Configuration - dfs.https.address is deprecated. Instead, use dfs.namenode.https-address 
2017-07-15 04:58:37,755 [pool-9-thread-1] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 
2017-07-15 04:58:37,787 [pool-9-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: a[6,4],a[-1,-1],aaa[10,6] C: R: 
===== Task Information Header ===== 
Command: testpy22.py (stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming) 
Start time: Sat Jul 15 04:58:37 PDT 2017 
Input-split file: file:/home/cloudera/data/movies_data.csv 
Input-split start-offset: 0 
Input-split length: 344 
=====   * * *   ===== 
2017-07-15 04:58:37,855 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1407418523_0009 
2017-07-15 04:58:37,855 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases a,aaa 
2017-07-15 04:58:37,855 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: a[6,4],a[-1,-1],aaa[10,6] C: R: 
2017-07-15 04:58:37,857 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 
Traceback (most recent call last): 
    File "/home/cloudera/testpy22.py", line 7, in <module> 
    f1,f2,f3,f4,f5=str(line).strip().split(",") 
ValueError: need more than 1 value to unpack 
2017-07-15 04:58:37,913 [Thread-98] ERROR org.apache.pig.impl.streaming.ExecutableManager - 'testpy22.py ' failed with exit status: 1 
2017-07-15 04:58:37,914 [Thread-94] INFO org.apache.hadoop.mapred.LocalJobRunner - Map task executor complete. 
2017-07-15 04:58:37,917 [Thread-99] ERROR org.apache.pig.impl.streaming.ExecutableManager - testpy22.py (stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming) failed with exit status: 1 
===== Task Information Footer ===== 
End time: Sat Jul 15 04:58:37 PDT 2017 
Exit code: 1 
Input records: 10 
Input bytes: 3568 bytes (stdin using org.apache.pig.builtin.PigStreaming) 
Output records: 0 
Output bytes: 0 bytes (stdout using org.apache.pig.builtin.PigStreaming) 
=====   * * *   ===== 
2017-07-15 04:58:37,921 [Thread-94] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1407418523_0009 
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'testpy22.py ' failed with exit status: 1 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406) 
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'testpy22.py ' failed with exit status: 1 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.cleanup(PigGenericMapBase.java:124) 
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) 
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268) 
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) 
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:138) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) 
    at java.lang.Thread.run(Thread.java:662) 
2017-07-15 04:58:42,875 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 
2017-07-15 04:58:42,877 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local1407418523_0009 has failed! Stop running all dependent jobs 
2017-07-15 04:58:42,877 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 
2017-07-15 04:58:42,878 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 
2017-07-15 04:58:42,878 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats reported below may be incomplete 
2017-07-15 04:58:42,879 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion PigVersion UserId StartedAt FinishedAt Features 
2.0.0-cdh4.7.0 0.11.0-cdh4.7.0 cloudera 2017-07-15 04:58:37 2017-07-15 04:58:42 STREAMING 

Failed! 

Failed Jobs: 
JobId Alias Feature Message Outputs 
job_local1407418523_0009 a,aaa STREAMING,MAP_ONLY Message: Job failed! file:/tmp/temp-947273140/tmp1217312985, 

Input(s): 
Failed to read data from "/home/cloudera/data/movies_data.csv" 

Output(s): 
Failed to produce result in "file:/tmp/temp-947273140/tmp1217312985" 

Job DAG: 
job_local1407418523_0009 


2017-07-15 04:58:42,879 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 
2017-07-15 04:58:42,881 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias aaa 
Details at logfile: /home/cloudera/pig_1500117064292.log 
grunt> 2017-07-15 04:58:43,685 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - 

Est-ce que n'importe quel corps peut me donner n'importe quelle suggestion?

Python version: 2.6.6 Apache Version Pig: Apache Version Pig 0.11.0-cdh4.7.0

+1

[Python 2.6? wow] (http://imgur.com/hiX84mb) – idjaw

+0

@idjaw 2.6.6 qui vient par défaut avec CDH. – Green

+2

Si vous avez le contrôle, mettez à niveau cette version de Python. – idjaw

Répondre

0

La question que vous exécutez en est lorsque vous déballer la liste directement dans une série de valeurs par exemple le "tuple déballage" caractéristique du python:

f1, f2, f3, f4, f5 = str(line).strip().split(",") 

Python renvoie une erreur si la scission ("") renvoie quoi que ce soit d'autres que 5 des variables. Je suppose que vous frappe une ligne blanche ...

ValueError: need more than 1 value to unpack

mise à niveau Definitely Python 2.7, mais cela est une fonctionnalité de base et la version ne vous impactage.