Push Limit into Loader
Pig optimizes limit query by pushing limit automatically to the loader, thus requiring only a fraction of the entire input to be scanned.
A = LOAD '1.txt' AS (col1: chararray, col2: chararray, col3: chararray, col4: chararray, col5: chararray);
B = GROUP A BY (col1, col2, col3, col4);
C = FOREACH B {
D = LIMIT A 1;
GENERATE FLATTEN(D);
};
DUMP C;
将使得Pig job name被设置为“This is my job”,从而在Hadoop jobtracker的web界面中可以很容易地找到你的job。如果不设置的话,其名字将显示为“PigLatin:DefaultJobName”。
(6)“scalar has more than one row in the output”错误的一个原因
遇到了这个错误?我来演示一下如何复现这个错误。
假设有两个文件:
A = LOAD 'a.txt' AS (col1: int, col2: int);
B = LOAD 'b.txt' AS (col1: int, col2: int);
C = JOIN A BY col1, B BY col1;
D = FOREACH C GENERATE A.col1;
DUMP D;
这段代码是必然会fail的,错误提示为:
1
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (1,2), 2nd :(3,4)
文章来源:http://www.codelast.com/
乍一看,似乎代码简单得一点问题都没有啊?其实仔细一看,“A.col1”的写法根本就是错误的,应该写成“A::col1”才对,因为你只要 DESCRIBE 一下 C 的schema就明白了:
Storing to a directory whose name ends in ".bz2" or ".gz" or ".lzo" (if you have installed support for LZO compression in Hadoop) will automatically use the corresponding compression codec.
output.compression.enabled and output.compression.codec job properties also work.
Loading from directories ending in .bz2 or .bz works automatically; other compression formats are not auto-detected on loading.