设为首页 收藏本站
查看: 1138|回复: 0

[经验分享] [综合]Apache Hadoop 2.2.0集群安装(1)[翻译]

[复制链接]

尚未签到

发表于 2016-12-11 07:20:45 | 显示全部楼层 |阅读模式
  用途
  此文档描述了如何安装、配置和维护一个重大集群从几个节点到上千节点。
  初次接触hadoop建议先从单节点集群开始。
  前提
  从Apache 上下载了稳定的版本。
  安装
  安装hadoop集群通常需要在所有的节点上解压软件或者prm安装。
  通常集群中的某一个节点被当做NameNode,其他节点作为ResourceManager,这些是主控节点。其他节点被当做DataNode和NodeManager,这些是从节点。
  非安全模式启动Hadoop
  接下来的章节将会阐述如何配置hadoop集群。
  配置文件
  hadoop中的配置文件有两大类型:
  只读型默认配置:core-default.xmlhdfs-default.xmlyarn-default.xml and mapred-default.xml
  定制化配置:conf/core-site.xml, conf/hdfs-site.xml, conf/yarn-site.xml and conf/mapred-site.xml.
  此外:你可以自己操作hadoop的脚本,在bin目录下可以找到,还有一些配置的环境变量在conf/hadoop-env.sh and yarn-env.sh中。
  站点配置:
  配置hadoop集群你首先要配置hadoop守护进程执行的环境。
  hadoop的守护进程包括NameNode/DataNode and ResourceManager/NodeManager.
  hadoop守护进程环境配置
  管理员需要使用conf/hadoop-env.sh and conf/yarn-env.sh脚本对hadoop守护进程做环境配置。
  首先你要验证JAVA_HOME在所有的节点上是否正确
  有时候你需要 HADOOP_PID_DIR and HADOOP_SECURE_DN_PID_DIR目录只能被启动守护进程的用户执行写操作。否则就会出现软连接攻击。

  管理员可以利用配置项单独配置进程,配置项如下:


Daemon
Environment Variable


NameNode
HADOOP_NAMENODE_OPTS


DataNode
HADOOP_DATANODE_OPTS


Secondary NameNode
HADOOP_SECONDARYNAMENODE_OPTS


ResourceManager
YARN_RESOURCEMANAGER_OPTS


NodeManager
YARN_NODEMANAGER_OPTS


WebAppProxy
YARN_PROXYSERVER_OPTS


Map Reduce Job History Server
HADOOP_JOB_HISTORYSERVER_OPTS



  如要配置Namenode 为parallelGC,那么可以添加如下到hadoop-env.sh中:

export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"
  其他有用的可定制化参数包括:
  HADOOP_LOG_DIR / YARN_LOG_DIR :进程日志目录,如果不存在会自动创建。
  HADOOP_HEAPSIZE / YARN_HEAPSIZE:内存堆大小默认单位为M,如果变量设置成1000 那么堆内存会设置成1000M,默认为1000,如果你需要配置他那么你可以为每个节点单独配置。


Daemon
Environment Variable


ResourceManager
YARN_RESOURCEMANAGER_HEAPSIZE


NodeManager
YARN_NODEMANAGER_HEAPSIZE


WebAppProxy
YARN_PROXYSERVER_HEAPSIZE


Map Reduce Job History Server
HADOOP_JOB_HISTORYSERVER_HEAPSIZE
  hadoop守护进程非安全模式配置:
  此章节是比较重要的参数配置,涉及信息如下:
  conf/core-site.xml


Parameter
Value
Notes


fs.defaultFS
NameNode URI
hdfs://host:port/


io.file.buffer.size
131072

SequenceFiles的读/写缓冲区大小
  conf/hdfs-site.xml
  NameNode的配置:


Parameter
Value
Notes


dfs.namenode.name.dir
Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.
If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.



dfs.namenode.hosts /dfs.namenode.hosts.exclude

List of permitted/excluded DataNodes.
If necessary, use these files to control the list of allowable datanodes.


dfs.blocksize
268435456
HDFS blocksize of 256MB for large file-systems.


dfs.namenode.handler.count
100
More NameNode server threads to handle RPCs from large number of DataNodes.
  DataNode配置:


Parameter
Value
Notes


dfs.datanode.data.dir
Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.
If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.
  conf/yarn-site.xml
  ResourceManager和NodeManager配置:


Parameter
Value
Notes


yarn.acl.enable

true /false

Enable ACLs? Defaults to false.


yarn.admin.acl
Admin ACL
ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.


yarn.log-aggregation-enable
false
Configuration to enable or disable log aggregation
  ResourceManager配置:


Parameter
Value
Notes


yarn.resourcemanager.address

ResourceManager host:port for clients to submit jobs.
host:port


yarn.resourcemanager.scheduler.address

ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.
host:port


yarn.resourcemanager.resource-tracker.address

ResourceManager host:port for NodeManagers.
host:port


yarn.resourcemanager.admin.address

ResourceManager host:port for administrative commands.
host:port


yarn.resourcemanager.webapp.address

ResourceManager web-ui host:port.
host:port


yarn.resourcemanager.scheduler.class

ResourceManager Scheduler class.

CapacityScheduler (recommended), FairScheduler(also recommended), or FifoScheduler



yarn.scheduler.minimum-allocation-mb
Minimum limit of memory to allocate to each container request at the Resource Manager.
In MBs


yarn.scheduler.maximum-allocation-mb
Maximum limit of memory to allocate to each container request at the Resource Manager.
In MBs



yarn.resourcemanager.nodes.include-path /yarn.resourcemanager.nodes.exclude-path

List of permitted/excluded NodeManagers.
If necessary, use these files to control the list of allowable NodeManagers.
  NodeManager配置:


Parameter
Value
Notes


yarn.nodemanager.resource.memory-mb
Resource i.e. available physical memory, in MB, for givenNodeManager

Defines total available resources on the NodeManager to be made available to running containers


yarn.nodemanager.vmem-pmem-ratio
Maximum ratio by which virtual memory usage of tasks may exceed physical memory
The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.


yarn.nodemanager.local-dirs
Comma-separated list of paths on the local filesystem where intermediate data is written.
Multiple paths help spread disk i/o.


yarn.nodemanager.log-dirs
Comma-separated list of paths on the local filesystem where logs are written.
Multiple paths help spread disk i/o.


yarn.nodemanager.log.retain-seconds
10800
Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.


yarn.nodemanager.remote-app-log-dir
/logs
HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.


yarn.nodemanager.remote-app-log-dir-suffix
logs
Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.


yarn.nodemanager.aux-services
mapreduce_shuffle
Shuffle service that needs to be set for Map Reduce applications.
  运行历史配置:


Parameter
Value
Notes


yarn.log-aggregation.retain-seconds
-1
How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.


yarn.log-aggregation.retain-check-interval-seconds
-1
Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.
  conf/mapred-site.xml
  MapReduce应用配置:


Parameter
Value
Notes


mapreduce.framework.name
yarn
Execution framework set to Hadoop YARN.


mapreduce.map.memory.mb
1536
Larger resource limit for maps.


mapreduce.map.java.opts
-Xmx1024M
Larger heap-size for child jvms of maps.


mapreduce.reduce.memory.mb
3072
Larger resource limit for reduces.


mapreduce.reduce.java.opts
-Xmx2560M
Larger heap-size for child jvms of reduces.


mapreduce.task.io.sort.mb
512
Higher memory-limit while sorting data for efficiency.


mapreduce.task.io.sort.factor
100
More streams merged at once while sorting files.


mapreduce.reduce.shuffle.parallelcopies
50
Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.
  MapReduce 执行历史服务配置:


Parameter
Value
Notes


mapreduce.jobhistory.address
MapReduce JobHistory Server host:port

Default port is 10020.


mapreduce.jobhistory.webapp.address
MapReduce JobHistory Server Web UIhost:port

Default port is 19888.


mapreduce.jobhistory.intermediate-done-dir
/mr-history/tmp
Directory where history files are written by MapReduce jobs.


mapreduce.jobhistory.done-dir
/mr-history/done
Directory where history files are managed by the MR JobHistory Server.
  Hadoop机架感知
  HDFS和YARN服务可机架感知的
  NameNode 和ResourceManager通过调用api来获取集群中每个从节点的机架信息。
  api以dns名称(或ip)作为一个机架id
  这个模块也是可配置的,通过topology.node.switch.mapping.impl来配置,可以通过命令行参数topology.script.file.name来配置,如果topology.script.file.name没有配置那么默认其ip为机架id。

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-312461-1-1.html 上篇帖子: Hadoop深入学习:再谈MapReduce作业提交和执行 下篇帖子: Hadoop为什么处理小数据量时效果不好?
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表