本次pig安装在一个hadoop伪分布式节点。 Pig是yahoo捐献给apache的一个项目,它是SQL-like语言,是在MapReduce上构建的一种高级查询语言,把一些运算编译进MapReduce模型的Map和Reduce中,并且用户可以定义自己的功能。 Pig是一个客户端应用程序,就算你要在Hadoop集群上运行Pig,也不需要在集群上装额外的东西。 首先从官网上下载pig安装包,并上传到服务器后。使用以下命令解压: [hadoop@hadoop1 soft]$ tar -zxvf pig-0.13.0.tar.gz 为了配置方便,简单可以修改一下解压后的文件名 [hadoop@hadoop1 ~]$ mv pig-0.13.0 pig2 在hadoop用户的.bash_profile中增加pig环境变量 [hadoop@hadoop1 ~]$ cat .bash_profile # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi # User specific environment and startup programs PATH=$PATH:$HOME/bin export PATH export JAVA_HOME=/usr/lib/jvm/java-1.7.0/ export HADOOP_HOME=/home/hadoop/hadoop2 export PIG_HOME=/home/hadoop/pig2 export PIG_CLASSPATH=$HADOOP_HOME/etc/hadoop/ export ATH=$PATH:$JAVA_HOME/bin/:$HADOOP_HOME/bin:$PIG_HOME/bin [hadoop@hadoop1 ~]$ source .bash_profile
Pig有两种模式: 一种是Localmode,也就是本地模式,这种模式下Pig运行在一个JVM里,访问的是本地的文件系统,只适合于小规模数据集,一般是用来体验Pig。而且,它并没有用到Hadoop的Localrunner,Pig把查询转换为物理的Plan,然后自己去执行。 在终端下输入 % pig -x local 就可以进入Local模式了。 还有一种就是Hadoop模式了,这种模式下,Pig才真正的把查询转换为相应的MapReduce Jobs,并提交到Hadoop集群去运行,集群可以是真实的分布式也可以是伪分布式。
[hadoop@hadoop1 ~]$ pig
14/09/10 21:04:08 INFOpig.ExecTypeProvider: Trying ExecType : LOCAL
14/09/10 21:04:08 INFOpig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/09/10 21:04:08 INFOpig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-09-10 21:04:09,149 [main] INFO org.apache.pig.Main - Apache Pig version0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-09-10 21:04:09,150 [main] INFO org.apache.pig.Main - Logging error messagesto: /home/hadoop/pig2/pig-err.log
2014-09-10 21:04:09,435 [main] INFO org.apache.pig.impl.util.Utils - Defaultbootup file /home/hadoop/.pigbootup not found
2014-09-10 21:04:10,345 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker isdeprecated. Instead, use mapreduce.jobtracker.address
2014-09-10 21:04:10,345 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated.Instead, use fs.defaultFS
2014-09-10 21:04:10,346 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://hadoop1:9000
2014-09-10 21:04:10,360 [main] INFO org.apache.hadoop.conf.Configuration.deprecation- mapred.used.genericoptionsparser is deprecated. Instead, usemapreduce.client.genericoptionsparser.used
2014-09-10 21:04:12,820 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker isdeprecated. Instead, use mapreduce.jobtracker.address
2014-09-10 21:04:12,821 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -Connecting to map-reduce job tracker at: hadoop1:9001
2014-09-10 21:04:12,831 [main] INFO org.apache.hadoop.conf.Configuration.deprecation- fs.default.name is deprecated. Instead, use fs.defaultFS
grunt>
grunt> help
Commands:
<pig latin statement>; - See thePigLatin manual for details: http://hadoop.apache.org/pig
File system commands:
fs <fs arguments> - Equivalent to Hadoop dfs command:http://hadoop.apache.org/common/docs/current/hdfs_shell.html
Diagnostic commands:
describe <alias>[::<alias] - Show the schema for the alias.Inner aliases can be described as A::B.
explain [-script <pigscript>] [-out <path>] [-brief][-dot|-xml] [-param <param_name>=<param_value>]
[-param_file <file_name>] [<alias>] - Show the executionplan to compute the alias or for entire script.
-script - Explain the entire script.
-out - Store the output into directory rather than print to stdout.
-brief - Don't expand nested plans (presenting a smaller graph foroverview).
-dot - Generate the output in .dot format. Default is text format.
-xml - Generate the output in .xml format. Default is text format.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
alias - Alias to explain.
dump <alias> - Compute the alias and writes the results to stdout.
Utility Commands:
exec [-param <param_name>=param_value] [-param_file<file_name>] <script> -
Execute the script with access to grunt environment including aliases.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
script - Script to be executed.
run [-param <param_name>=param_value] [-param_file<file_name>] <script> -
Execute the script with access to grunt environment.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
script - Script to be executed.
sh <shell command> - Invokea shell command.
kill <job_id> - Kill the hadoop job specified by the hadoop jobid.
set <key> <value> - Provide execution parameters to Pig.Keys and values are case sensitive.
The following keys are supported:
default_parallel - Script-level reduce parallelism. Basic input sizeheuristics used by default.
debug - Set debug on or off. Default is off.
job.name - Single-quoted name for jobs. Default is PigLatin:<scriptname>
job.priority - Priority for jobs. Values: very_low, low, normal, high,very_high. Default is normal
stream.skippath - String that contains the path. This is used bystreaming.
any hadoop property.
help - Display this message.
history [-n] - Display the list statements in cache.
-n Hide line numbers.
quit - Quit the grunt shell.
grunt>
|