Je pratique Apache Pig. À l'aide de l'opérateur DEFINE et STREAM, je veux diffuser un fichier en utilisant un script python et obtenir une sortie éditée.Apache Pig: DEFINE STREAM Erreur lors de l'utilisation du code Python
Below is the file I am using.
[[email protected] ~]$ cat data/movies_data.csv
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriels Wedding,1994,3.5,6323
8,Mothers Boys,1994,3.4,5733
9,Nosferatu Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333
La sortie I attendu de porc en utilisant python est le premier multiple de la valeur du champ à 10, des deuxièmes données de terrain convertir en majuscules et troisième augmentation de l'année sur le terrain par 1.
sortie prévue de l'échantillon:
10,THE NIGHTMARE BEFORE CHRISTMAS,1994,3.9,4568
20,THE MUMMY,1933,3.5,4388
Mon code Python que j'ai utilisé dans
[[email protected] ~]$cat testpy22.py
#!/usr/bin/python
import sys
import string
for line in sys.stdin:
(f1,f2,f3,f4,f5)=str(line).strip().split(",")
f1 = f1*10
f2 = f2.upper()
f3 = f3+1
print"%d\t%s\t%d\t%.2f\t%d"%(f1,f2,f3,f4,f5)
Et ci-dessous sont les Le code de porc que j'essaye:
grunt> a = load '/home/cloudera/data/movies_data.csv' using PigStorage(',') as (id:chararray,movie:chararray,year:chararray, point:chararray,code:chararray);
grunt> dump a;
Output(s):
Successfully stored records in: "file:/tmp/temp-947273140/tmp1180787799"
Job DAG:
job_local1521960706_0008
2017-07-15 04:56:26,250 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2017-07-15 04:56:26,250 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2017-07-15 04:56:26,251 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-07-15 04:56:26,251 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)
(6,One Magic Christmas,1985,3.8,5333)
(7,Muriels Wedding,1994,3.5,6323)
(8,Mothers Boys,1994,3.4,5733)
(9,Nosferatu Original Version,1929,3.5,5651)
(10,Nick of Time,1995,3.4,5333)
grunt> DEFINE testpy22 `testpy22.py` SHIP('/home/cloudera/testpy22.py');
grunt> aaa = STREAM a through testpy22;
grunt> dump aaa;
Quand je jette les données, j'obtiens l'erreur suivante. Je suppose que l'erreur est due au code Python. Mais je ne suis pas capable de trouver le problème.
2017-07-15 04:58:37,718 [pool-9-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/home/cloudera/data/movies_data.csv:0+344
2017-07-15 04:58:37,736 [pool-9-thread-1] WARN org.apache.hadoop.conf.Configuration - dfs.https.address is deprecated. Instead, use dfs.namenode.https-address
2017-07-15 04:58:37,755 [pool-9-thread-1] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2017-07-15 04:58:37,787 [pool-9-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: a[6,4],a[-1,-1],aaa[10,6] C: R:
===== Task Information Header =====
Command: testpy22.py (stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)
Start time: Sat Jul 15 04:58:37 PDT 2017
Input-split file: file:/home/cloudera/data/movies_data.csv
Input-split start-offset: 0
Input-split length: 344
===== * * * =====
2017-07-15 04:58:37,855 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1407418523_0009
2017-07-15 04:58:37,855 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases a,aaa
2017-07-15 04:58:37,855 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: a[6,4],a[-1,-1],aaa[10,6] C: R:
2017-07-15 04:58:37,857 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
Traceback (most recent call last):
File "/home/cloudera/testpy22.py", line 7, in <module>
f1,f2,f3,f4,f5=str(line).strip().split(",")
ValueError: need more than 1 value to unpack
2017-07-15 04:58:37,913 [Thread-98] ERROR org.apache.pig.impl.streaming.ExecutableManager - 'testpy22.py ' failed with exit status: 1
2017-07-15 04:58:37,914 [Thread-94] INFO org.apache.hadoop.mapred.LocalJobRunner - Map task executor complete.
2017-07-15 04:58:37,917 [Thread-99] ERROR org.apache.pig.impl.streaming.ExecutableManager - testpy22.py (stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming) failed with exit status: 1
===== Task Information Footer =====
End time: Sat Jul 15 04:58:37 PDT 2017
Exit code: 1
Input records: 10
Input bytes: 3568 bytes (stdin using org.apache.pig.builtin.PigStreaming)
Output records: 0
Output bytes: 0 bytes (stdout using org.apache.pig.builtin.PigStreaming)
===== * * * =====
2017-07-15 04:58:37,921 [Thread-94] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1407418523_0009
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'testpy22.py ' failed with exit status: 1
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'testpy22.py ' failed with exit status: 1
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.cleanup(PigGenericMapBase.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2017-07-15 04:58:42,875 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-07-15 04:58:42,877 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local1407418523_0009 has failed! Stop running all dependent jobs
2017-07-15 04:58:42,877 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2017-07-15 04:58:42,878 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2017-07-15 04:58:42,878 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats reported below may be incomplete
2017-07-15 04:58:42,879 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.0.0-cdh4.7.0 0.11.0-cdh4.7.0 cloudera 2017-07-15 04:58:37 2017-07-15 04:58:42 STREAMING
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_local1407418523_0009 a,aaa STREAMING,MAP_ONLY Message: Job failed! file:/tmp/temp-947273140/tmp1217312985,
Input(s):
Failed to read data from "/home/cloudera/data/movies_data.csv"
Output(s):
Failed to produce result in "file:/tmp/temp-947273140/tmp1217312985"
Job DAG:
job_local1407418523_0009
2017-07-15 04:58:42,879 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2017-07-15 04:58:42,881 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias aaa
Details at logfile: /home/cloudera/pig_1500117064292.log
grunt> 2017-07-15 04:58:43,685 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
Est-ce que n'importe quel corps peut me donner n'importe quelle suggestion?
Python version: 2.6.6 Apache Version Pig: Apache Version Pig 0.11.0-cdh4.7.0
[Python 2.6? wow] (http://imgur.com/hiX84mb) – idjaw
@idjaw 2.6.6 qui vient par défaut avec CDH. – Green
Si vous avez le contrôle, mettez à niveau cette version de Python. – idjaw