. Hadoop简介
hadoop是apache的开源项目,开发的主要目的是为了构建可靠,可拓展scalable,分布式的系统,hadoop是一系列的子工程的总和,其中包含。
1. hadoop common:为其他项目提供基础设施
2. HDFS:分布式的文件系统
3. MapReduce:A software framework for distributed processing of large data sets on compute clusters。一个简化分布式编程的框架。
4. 其他工程包含:Avro(序列化系统),Cassandra(数据库项目)等 . Hadoop环境建立
这里主要是包含hadoop环境的建立,以便下面能够测试MapReduce和HDFS,注意这里仅仅是在一台主机ubuntu上测试。
2.1 ubuntu必备软件
hadoop需要首先安装jdk和ssh,rsync两个依赖项,这里暂时略去java的安装过程,ubuntu下安装ssh命令如下:
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ sudo apt-get install ssh rsync
安装完成之后,试着ssh localhost:
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ ssh localhost
Linux ubuntu 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009 i686
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
To access official Ubuntu documentation, please visit:
http://help.ubuntu.com/
Last login: Fri Apr 22 00:28:54 2011 from localhost
xuqiang@ubuntu:~$
表明ssh安装成功,使用exit命令退出ssh。 2.2 下载hadoop的release版本
http://apache.etoak.com//hadoop/core/
2.3 修改hadoop配置文件
在上一步下载完成的压缩包,解压,目录结构如下:
xuqiang@ubuntu:~/hadoop/src/hadoop-0.21.0$ ls
bin hadoop-mapred-test-0.21.0.jar
c++ hadoop-mapred-tools-0.21.0.jar
common hdfs
conf input
hadoop-common-0.21.0.jar lib
hadoop-common-test-0.21.0.jar LICENSE.txt
hadoop-hdfs-0.21.0.jar logs
hadoop-hdfs-0.21.0-sources.jar mapred
hadoop-hdfs-ant-0.21.0.jar NOTICE.txt
hadoop-hdfs-test-0.21.0.jar output
hadoop-hdfs-test-0.21.0-sources.jar README.txt
hadoop-mapred-0.21.0.jar webapps
hadoop-mapred-0.21.0-sources.jar
hadoop-mapred-examples-0.21.0.jar 其中bin目录下主要是包含启动hadoop的脚本文件,conf目录下是hadoop的配置文件,c++目录是hadoop的c++开发时所需头文件。
修改conf/hadoop-env.sh文件,修改其中的JAVA_HOME选项:
# The java implementation to use. Required.