|
Hadoop 环境搭建
参考资料:
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/ClusterSetup.html
http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
下载 2.4.1 bin 包, 解压好以后按照链接上配置各个配置文件, 启动时会遇到 "Unable to load realm info from SCDynamicStore" 的问题, 这个问题需要在 hadoop-env.sh 中加入如下配置(配置 HBase 的时候也会遇到这个问题, 使用同样的方法在 hbase-env.sh 中加入如下配置解决)
hadoop-env.sh(hbase-env.sh) 配置, 增加
export JAVA_HOME="/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home"
export HBASE_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
最后自己写一下启动和停止脚本
hadoop-start.sh
#!/bin/bash
HADOOP_PREFIX="/Users/zhenweiliu/Work/Software/hadoop-2.4.1"
HADOOP_YARN_HOME="/Users/zhenweiliu/Work/Software/hadoop-2.4.1"
HADOOP_CONF_DIR="/Users/zhenweiliu/Work/Software/hadoop-2.4.1/etc/hadoop"
cluster_name="hadoop_cat"
# Format a new distributed filesystem
if [ "$1" == "format" ]; then
$HADOOP_PREFIX/bin/hdfs namenode -format $cluster_name
fi
# Start the HDFS with the following command, run on the designated NameNode:
$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
# Run a script to start DataNodes on all slaves:
$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
# Start the YARN with the following command, run on the designated ResourceManager:
$HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
# Run a script to start NodeManagers on all slaves:
$HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
# Start a standalone WebAppProxy server. If multiple servers are used with load balancing it should be run on each of them:
$HADOOP_YARN_HOME/sbin/yarn-daemon.sh start proxyserver --config $HADOOP_CONF_DIR
# Start the MapReduce JobHistory Server with the following command, run on the designated server:
$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
hadoop-stop.sh
#!/bin/bash
HADOOP_PREFIX="/Users/zhenweiliu/Work/Software/hadoop-2.4.1"
HADOOP_YARN_HOME="/Users/zhenweiliu/Work/Software/hadoop-2.4.1"
HADOOP_CONF_DIR="/Users/zhenweiliu/Work/Software/hadoop-2.4.1/etc/hadoop"
cluster_name="hadoop_cat"
# Stop the NameNode with the following command, run on the designated NameNode:
$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop namenode
# Run a script to stop DataNodes on all slaves:
$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode
# Stop the ResourceManager with the following command, run on the designated ResourceManager:
$HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop resourcemanager
# Run a script to stop NodeManagers on all slaves:
$HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager
# Stop the WebAppProxy server. If multiple servers are used with load balancing it should be run on each of them:
$HADOOP_YARN_HOME/sbin/yarn-daemon.sh stop proxyserver --config $HADOOP_CONF_DIR
# Stop the MapReduce JobHistory Server with the following command, run on the designated server:
$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOOP_CONF_DIR
hadoop-restart.sh
#!/bin/bash
./hadoop-stop.sh
./hadoop-start.sh
最后是我的各项需要配置的 hadoop 配置
core-site.xml
fs.defaultFS
hdfs://localhost:9000
io.file.buffer.size
131072
hdfs-site.xml
dfs.datanode.max.xcievers
4096
dfs.datanode.datadir
file:///Users/zhenweiliu/Work/Software/hadoop-2.4.1/data
dfs.blocksize
67108864
dfs.namenode.handler.count
100
dfs.namenode.name.dir
file:///Users/zhenweiliu/Work/Software/hadoop-2.4.1/name
yarn-site.xml
?xml version="1.0"?>
yarn.acl.enable
false
yarn.acl.enable
false
yarn.resourcemanager.address
localhost:9001
yarn.resourcemanager.scheduler.address
localhost:9002
yarn.resourcemanager.resource-tracker.address
localhost:9003
yarn.resourcemanager.admin.address
localhost:9004
yarn.resourcemanager.webapp.address
localhost:9005
yarn.resourcemanager.scheduler.class
CapacityScheduler
yarn.scheduler.minimum-allocation-mb
1024
yarn.scheduler.maximum-allocation-mb
8192
yarn.nodemanager.resource.memory-mb
8192
yarn.nodemanager.vmem-pmem-ratio
2.1
yarn.nodemanager.local-dirs
${hadoop.tmp.dir}/nm-local-dir
yarn.nodemanager.log-dirs
${yarn.log.dir}/userlogs
yarn.nodemanager.log.retain-seconds
10800
yarn.nodemanager.remote-app-log-dir
/logs
yarn.nodemanager.remote-app-log-dir-suffix
logs
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.log-aggregation.retain-seconds
-1
yarn.log-aggregation.retain-check-interval-seconds
-1
mapred-site.xml
mapreduce.framework.name
yarn
mapreduce.map.memory.mb
1536
mapreduce.map.java.opts
-Xmx1024M
mapreduce.reduce.memory.mb
3072
mapreduce.reduce.java.opts
-Xmx2560M
mapreduce.task.io.sort.mb
512
mapreduce.task.io.sort.factor
100
mapreduce.reduce.shuffle.parallelcopies
50
mapreduce.jobhistory.address
localhost:10020
mapreduce.jobhistory.webapp.address
localhost:19888
mapreduce.jobhistory.intermediate-done-dir
file:////Users/zhenweiliu/Work/Software/hadoop-2.4.1/mr-history/tmp
mapreduce.jobhistory.done-dir
file:////Users/zhenweiliu/Work/Software/hadoop-2.4.1/mr-history/done
ZK伪分布式配置
复制 3 个 ZK 实例文件夹, 分别为
zookeeper-3.4.5-1
zookeeper-3.4.5-2
zookeeper-3.4.5-3
每个 ZK 文件下的 zoo.cfg 配置如下
zookeeper-3.4.5-1/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/Users/zhenweiliu/Work/Software/zookeeper/zookeeper-3.4.5-1/data
dataLogDir=/Users/zhenweiliu/Work/Software/zookeeper/zookeeper-3.4.5-1/logs
clientPort=2181
server.1=127.0.0.1:2888:3888
server.2=127.0.0.1:2889:3889
server.3=127.0.0.1:2890:3890
zookeeper-3.4.5-2/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/Users/zhenweiliu/Work/Software/zookeeper/zookeeper-3.4.5-2/data
dataLogDir=/Users/zhenweiliu/Work/Software/zookeeper/zookeeper-3.4.5-2/logs
clientPort=2182
server.1=127.0.0.1:2888:3888
server.2=127.0.0.1:2889:3889
server.3=127.0.0.1:2890:3890
zookeeper-3.4.5-3/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/Users/zhenweiliu/Work/Software/zookeeper/zookeeper-3.4.5-3/data
dataLogDir=/Users/zhenweiliu/Work/Software/zookeeper/zookeeper-3.4.5-3/logs
clientPort=2183
server.1=127.0.0.1:2888:3888
server.2=127.0.0.1:2889:3889
server.3=127.0.0.1:2890:3890
然后在每个实例的 data 文件夹下创建一个文件 myid, 文件内分别写入 1, 2, 3 三个字符, 例如
zookeeper-3.4.5-1/data/myid
1
最后做一个批量启动, 停止脚本
startZkCluster.sh
#!/bin/bash
BASE_DIR="/Users/zhenweiliu/Work/Software/zookeeper/zookeeper-3.4.5"
BIN_EXEC="bin/zkServer.sh start"
for no in $(seq 1 3)
do
$BASE_DIR"-"$no/$BIN_EXEC
done
stopZkCluster.sh
#!/bin/bash
BASE_DIR="/Users/zhenweiliu/Work/Software/zookeeper/zookeeper-3.4.5"
BIN_EXEC="bin/zkServer.sh stop"
for no in $(seq 1 3)
do
$BASE_DIR"-"$no/$BIN_EXEC
done
restartZkCluster.sh
#!/bin/bash
./stopZkCluster.sh
./startZkCluster.sh
HBase
参考资料:
http://abloz.com/hbase/book.html
实际上 HBase 内置了 ZK, 如果不显式指定 ZK 的配置, 他会使用内置的 ZK, 这个 ZK 会随着 HBase 启动而启动
hbase-env.sh 中显式启动内置 ZK
export HBASE_MANAGES_ZK=true
hbase-site.xml
hbase.rootdir
hdfs://localhost:9000/hbase
The directory shared by RegionServers.
dfs.replication
1
The replication count for HLog and HFile storage. Should not be greater than HDFS datanode count.
hbase.zookeeper.quorum
localhost
hbase.zookeeper.property.dataDir
/Users/zhenweiliu/Work/Software/hbase-0.98.3-hadoop2/zookeeper
hbase.zookeeper.property.clientPort
2222
Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.
hbase.cluster.distributed
true
最后启动 hbase
./start-hbase.sh
系统参数
另外, hbase 需要大得 processes 数以及 open files 数, 所以需要修改 ulimit, 我的 mac 下增加 /etc/launchd.conf 文件, 文件内容
limit maxfiles 16384 16384
limit maxproc 2048 2048
在 /etc/profile 添加
ulimit -n 16384
ulimit -u 2048
如果 hbase 出现
2014-07-14 23:00:48,342 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ERROR: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:90)
at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:73)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
1. 查看 hbase master log, 发现
2014-07-14 23:31:51,270 INFO [master:192.168.126.8:60000] util.FSUtils: Waiting for dfs to exit safe mode...
退出 hadoop 安全模式
bin/hdfs dfsadmin -safemode leave
master log 报错
2014-07-14 23:32:22,238 WARN [master:192.168.126.8:60000] hdfs.DFSClient: DFS Read
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1761102757-192.168.126.8-1404787541755:blk_1073741825_1001 file=/hbase/hbase.version
检查 hdfs
./hdfs fsck / -files -blocks
14/07/14 23:36:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://localhost:50070
FSCK started by zhenweiliu (auth:SIMPLE) from /127.0.0.1 for path / at Mon Jul 14 23:36:33 CST 2014
.
/hbase/WALs/192.168.126.8,60020,1404917152583-splitting/192.168.126.8%2C60020%2C1404917152583.1404917158940: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741842
/hbase/WALs/192.168.126.8,60020,1404917152583-splitting/192.168.126.8%2C60020%2C1404917152583.1404917158940: MISSING 1 blocks of total size 17 B..
/hbase/WALs/192.168.126.8,60020,1404917152583-splitting/192.168.126.8%2C60020%2C1404917152583.1404917167188.meta: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741843
/hbase/WALs/192.168.126.8,60020,1404917152583-splitting/192.168.126.8%2C60020%2C1404917152583.1404917167188.meta: MISSING 1 blocks of total size 401 B..
/hbase/data/hbase/meta/.tabledesc/.tableinfo.0000000001: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741829
/hbase/data/hbase/meta/.tabledesc/.tableinfo.0000000001: MISSING 1 blocks of total size 372 B..
/hbase/data/hbase/meta/1588230740/.regioninfo: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741827
/hbase/data/hbase/meta/1588230740/.regioninfo: MISSING 1 blocks of total size 30 B..
/hbase/data/hbase/meta/1588230740/info/e63bf8b1e649450895c36f28fb88da98: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741836
/hbase/data/hbase/meta/1588230740/info/e63bf8b1e649450895c36f28fb88da98: MISSING 1 blocks of total size 1340 B..
/hbase/data/hbase/meta/1588230740/oldWALs/hlog.1404787632739: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741828
/hbase/data/hbase/meta/1588230740/oldWALs/hlog.1404787632739: MISSING 1 blocks of total size 17 B..
/hbase/data/hbase/namespace/.tabledesc/.tableinfo.0000000001: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741832
/hbase/data/hbase/namespace/.tabledesc/.tableinfo.0000000001: MISSING 1 blocks of total size 286 B..
/hbase/data/hbase/namespace/a3fbb84530e05cab6319257d03975e6b/.regioninfo: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741833
/hbase/data/hbase/namespace/a3fbb84530e05cab6319257d03975e6b/.regioninfo: MISSING 1 blocks of total size 40 B..
/hbase/data/hbase/namespace/a3fbb84530e05cab6319257d03975e6b/info/770eb1a6dc76458fb97e9213edb80b72: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741837
/hbase/data/hbase/namespace/a3fbb84530e05cab6319257d03975e6b/info/770eb1a6dc76458fb97e9213edb80b72: MISSING 1 blocks of total size 1045 B..
/hbase/hbase.id: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741826
/hbase/hbase.id: MISSING 1 blocks of total size 42 B..
/hbase/hbase.version: CORRUPT blockpool BP-1761102757-192.168.126.8-1404787541755 block blk_1073741825
/hbase/hbase.version: MISSING 1 blocks of total size 7 B.Status: CORRUPT
Total size: 3597 B
Total dirs: 21
Total files: 11
Total symlinks: 0
Total blocks (validated): 11 (avg. block size 327 B)
********************************
CORRUPT FILES: 11
MISSING BLOCKS: 11
MISSING SIZE: 3597 B
CORRUPT BLOCKS: 11
********************************
Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 11
Missing replicas: 0
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Jul 14 23:36:33 CST 2014 in 15 milliseconds
The filesystem under path '/' is CORRUPT
执行删除
./hdfs fsck -delete
14/07/14 23:41:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://localhost:50070
FSCK started by zhenweiliu (auth:SIMPLE) from /127.0.0.1 for path / at Mon Jul 14 23:41:46 CST 2014
Status: HEALTHY
Total size: 0 B
Total dirs: 21
Total files: 0
Total symlinks: 0
Total blocks (validated): 0
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Jul 14 23:41:46 CST 2014 in 4 milliseconds
The filesystem under path '/' is HEALTHY
这时发现 hbase 挂了, 查看 master log
2014-07-14 23:48:53,788 FATAL [master:192.168.126.8:60000] master.HMaster: Unhandled exception. Starting shutdown.
org.apache.hadoop.hbase.util.FileSystemVersionException: HBase file layout needs to be upgraded. You have version null and I want version 8. Is your hbase.rootdir valid? If so, you may need to run 'hbase hbck -fixVersionFile'.
at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:602)
at org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir(MasterFileSystem.java:456)
at org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSystemLayout(MasterFileSystem.java:147)
at org.apache.hadoop.hbase.master.MasterFileSystem.(MasterFileSystem.java:128)
at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:802)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:615)
at java.lang.Thread.run(Thread.java:695)
重建一下 hdfs/hbase 文件
bin/hadoop fs -rm -r /hbase
hbase master 报错
2014-07-14 23:56:33,999 INFO [master:192.168.126.8:60000] catalog.CatalogTracker: Failed verification of hbase:meta,,1 at address=192.168.126.8,60020,1405352769509, exception=org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on 192.168.126.8,60020,1405353371628
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2683)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4117)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionInfo(HRegionServer.java:3494)
at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20036)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:168)
at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:39)
at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:111)
at java.lang.Thread.run(Thread.java:695)
重建 region sever 节点
bin/hbase zkcli
rmr /hbase/meta-region-server
再次重启 hbase, 解决
HBase 重要参数
这些参数在 hbase-site.xml 里配置
1. zookeeper.session.timeout
这个默认值是3分钟。这意味着一旦一个server宕掉了,Master至少需要3分钟才能察觉到宕机,开始恢复。你可能希望将这个超时调短,这样Master就能更快的察觉到了。在你调这个值之前,你需要确认你的JVM的GC参数,否则一个长时间的GC操作就可能导致超时。(当一个RegionServer在运行一个长时间的GC的时候,你可能想要重启并恢复它).
要想改变这个配置,可以编辑 hbase-site.xml, 将配置部署到全部集群,然后重启。
我们之所以把这个值调的很高,是因为我们不想一天到晚在论坛里回答新手的问题。“为什么我在执行一个大规模数据导入的时候Region Server死掉啦”,通常这样的问题是因为长时间的GC操作引起的,他们的JVM没有调优。我们是这样想的,如果一个人对HBase不很熟悉,不能期望他知道所有,打击他的自信心。等到他逐渐熟悉了,他就可以自己调这个参数了。
2. hbase.regionserver.handler.count
这个设置决定了处理用户请求的线程数量。默认是10,这个值设的比较小,主要是为了预防用户用一个比较大的写缓冲,然后还有很多客户端并发,这样region servers会垮掉。有经验的做法是,当请求内容很大(上MB,如大puts, 使用缓存的scans)的时候,把这个值放低。请求内容较小的时候(gets, 小puts, ICVs, deletes),把这个值放大。
当客户端的请求内容很小的时候,把这个值设置的和最大客户端数量一样是很安全的。一个典型的例子就是一个给网站服务的集群,put操作一般不会缓冲,绝大多数的操作是get操作。
把这个值放大的危险之处在于,把所有的Put操作缓冲意味着对内存有很大的压力,甚至会导致OutOfMemory.一个运行在内存不足的机器的RegionServer会频繁的触发GC操作,渐渐就能感受到停顿。(因为所有请求内容所占用的内存不管GC执行几遍也是不能回收的)。一段时间后,集群也会受到影响,因为所有的指向这个region的请求都会变慢。这样就会拖累集群,加剧了这个问题。
你可能会对handler太多或太少有感觉,可以通过 Section 12.2.2.1, “启用 RPC级 日志” ,在单个RegionServer启动log并查看log末尾 (请求队列消耗内存)。
|
|