|
术语
SPOF(single point of failure)
NN (Name Node)
Hadoop 2.0 HA
The HDFS Name Node is primarily responsible for serving two types of file system metadata: file system namespace information and block locations. Because of the architecture of HDFS, these must be handled separately.
Namespace Information ,Block Locations ..相关概念参考
http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/
Background
Prior to Hadoop 0.23.2, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
This impacted the total availability of the HDFS cluster in two major ways:
- In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.
- Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.
The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
Architecture
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
In order for the Standby node to keep its state synchronized with the Active node, the current implementation requires that the two nodes both have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.
When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby node is constantly watching this directory for edits, and as it sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
note:1、为提供快速的失效备援,需要处于备份状态的NameNode结点有集群中块位置的最新信息,为了实现这一点,处于这两个NameNode管理的所有DataNodes,都需要向这两个NameNode发送块信息和心跳信息。
2、有两种模式的NameNode,分别是Active和Standby模式。Active模式的NameNode接受client的RPC请求并处理,同时写自己的Editlog和共享存储上的Editlog,接收DataNode的Block report, block location updates和heartbeat;Standby模式的NameNode同样会接到来自DataNode的Block report, block location updates和heartbeat,同时会从共享存储的Editlog上读取并执行这些log操作,使得自己的NameNode中的元数据(Namespcae information + Block locations map)都是和Active NameNode中的元数据是同步的。所以说Standby模式的NameNode是一个热备(Hot Standby NameNode),一旦切换成Active模式,马上就可以提供NameNode服务。
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the administrator must configure at least one fencing method for the shared storage. During a failover, if it cannot be verified that the previous Active node has relinquished its Active state, the fencing process is responsible for cutting off the previous Active's access to the shared edits storage. This prevents it from making any further edits to the namespace, allowing the new Active to safely proceed with failover.
Ps:此屏障类似于内存屏障??
Note: Currently, only manual failover is supported. This means the HA NameNodes are incapable of automatically detecting a failure of the Active NameNode, and instead rely on the operator to manually initiate a failover. Automatic failure detection and initiation of a failover will be implemented in future versions.
Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
Future Work
Other options to share NNmetadata
1. BookKeeper
2. Multiple,potentially non-HA filers
3. Entirely different metadata system
More advanced client failover/load shedding
1. Serve stale read from the standby NN
2. Speculative RPC
3. Non-RPCclients(IP failover,DNS failover,proxy,etc.)
4. EvenHigherHA Multiple standby NNs
参考:
官方文档
http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html
设计文档 :https://issues.apache.org/jira/secure/attachment/12480489/NameNode%20HA_v2_1.pdf
方案对比分析
http://www.iyunv.com/Linux/2012-06/63565.htm
Cloudera 文档
http://www.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/
源代码
http://svn.apache.org/repos/asf/hadoop/common/branches/HDFS-1623/
源码分析
http://yanbohappy.sinaapp.com/?p=50
ppt
http://www.slideshare.net/hortonworks/nn-ha-hadoop-worldfinal
FAQ:
Shared Storage vs. Shared‐nothing Storage for NN Metadata
The Active and Standby can either share storage (say NFS) or one can have the Active stream
edits to the Standby (as is done for the Backup Node in release 21 onwards).Some of the
considerations are:
ü The shared storage server becomes the single point of failure and it needs to be highly available. Bookkeeper is good solution here but not ready for prime time, though perhaps the longer term solution. With Bookkeeper, NN need not to maintain the tate on local disks, making NNs completely “stateless”.Some organizations already have a HA NFS server in their cluster for other reasons.
ü BackupNode is less expensive, because it does not require the use of shared storage server.However it does not support Use case 3f. (Hand over from Active to Standby must always be done).
ü BackupNode does not require fencing of storage, as long as shared storage is not used to solve use case 3f.?? Shared storage requires fencing. However, if we use Stonith for solving any of fencing issues then it solves all fencing needs.
ü There is a lack of symmetry with the BackupNode in that the BackupNode cannot take over unless it is fully sync’ed with the Active.
ü When BackupNode is down one still has to rely on remote storage for storing state external to the Active – which brings back to the shared case.
Notes:BookKeeper – BookKeeper is a highly available write-ahead logging system. Work has already been done to allow the HDFS Name Node to be able to write its edits log to BookKeeper, though this has not yet been tested with the HA Name Node.
Failover control outside NN using FailoverController(Watchdog)
Our approach is to use a FailoverController1 daemon that is separate from the actual NN. The Failover Controller daemon is very similar to the Resource Manager in Linux HA. For Linux HA based solution, the RM that comes as part of it could be used directly. For Zookeeper we could write our own, or configure the Linux HA Resource manager to use Zookeeper. FailoverController performs the following functions:
1.Monitors the health of the NN, the OS and the HW and other resources such as connectivity to the network.
2.Performs heartbeat so that a leader can be elected (could heartbeat to the Zookeeper and let Zookeeper elect the leader).
3.On leader election an Active is selected. The active FailoverController instructs its NN to go from Standby to Active. (Note each NN starts as Standby; it becomes active only when instructed by the FailoverController)
Having a separate FailoverController daemon has several advantages://降低耦合度
Building this functionality into NN will make heartbeat mechanism susceptible to GC failures.
FailoverController functionality should be a tightly written code, separate from the failure of the pplication for which fault tolerance is enabled.Makes the leader election pluggable.
Client
redirection
after
failover
Client的通过RPC的Proxy与NameNode交互。在client端会有两个代理同时存在,分别代表与Active和Standby的NameNode的连接。由于Client端有Retry机制,当与Active NameNode正常通信的client proxy收到RPC返回的StandbyException时,说明这个Active NameNode已经变成了Standby模式,所以触发dfs.client.failover.proxy.provider.[nameservice ID]这个参数指定的类来做failover,目前唯一的实现是ConfiguredFailoverProxyProvider,实现方法就是下次开始把RPC发向另外一个NameNode。此后的RPC都是发往另外一个NameNode,也就是NameNode发生了主从切换。
Others reference:
Design document
https://issues.apache.org/jira/secure/attachment/12480489/NameNode%20HA_v2_1.pdf
HDFS Federation
传统HDFS是master/slave结构,其中,master(也就是NameNode)需要存储所有文件系统的元数据信息,且所有文件存储操作均需要访问多次NameNode,因而NameNode成为制约扩展性的主要瓶颈所在。为了解决该问题,引入了HDFS Federation,允许HDFS中存在多个NameNode,且每个NameNode分管一部分目录,而DataNode不变,也就是“从中央集权专政变为各个地方自治”,进而缩小了故障带来的影响范围,并起到一定的隔离作用。具体参考:
http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html
http://dongxicheng.org/mapreduce-nextgen/nextgen-mapreduce-introduction/
http://yanbohappy.sinaapp.com/?p=32
附HA方案:
什么是高可用性?
高可用集群是指以减少服务中断时间为目的的服务器集群技术。
高可用性HA(HighAvailability)指的是通过尽量缩短因日常维护操作(计划)和突发的系统崩溃(非计划)所导致的停机时间,以提高系统和应用的可用性。
高可用性(HA)的功能:
1、软件故障监测与排除
2、备份和数据保护
3、管理站能够监视各站点的运行情况,能随时或定时报告系统运行状况,故障能及时报告和告警,并有必要的控制手段
4、实现错误隔离以及主、备份服务器间的服务切换
背景分析:
HDFS的高可用性(HA)是Hadoop的一个缺点,不管是HDFS还是Map-Reduce,都是采用单master的方式,集群中的其他机器都是与一台中心机器进行通信,如果这个中心机器挂了,集群就只有不工作了(不一定数据会丢失,但是至少需要重启等等工作),让可用性变得更低。这个一般叫做单点失败 (single point of failure,SPOF)。本文通过比较Hadoop各版的改变还查看对这个问题的解决方案。
问题描述:
解决Namenode结点宕机时导致的集群不可用, 增强HDFS的高可用性
方案描述:
在Hadoop0.23.1版本以前,当Namenode所在服务器宕机时,可利用Namenode备份的元数据重构新的Namenode来投入使用。
方案一:
Hadoop本身提供了可利用secondaryNamenode的备份数据来恢复Namenode的元数据的方案,但因为checkpoint(在每次checkpoint的时候secondaryNamenode才会合并并同步Namenode的数据)的问题,secondaryNamenode的备份数据并不能时刻保持与Namenode同步,即在Namenode宕机的时候secondaryNamenode会丢失一段时间的数据,这段时间取决于checkpoint的周期。可以减小checkpoint的周期来减少数据的丢失量,但由于每次checkpoint很耗性能,而且这种方案也不能从根本上解决数据丢失的问题。
缺点:secondaryNamenode的备份数据并不能时刻保持与Namenode同步,不能从根本上解决数据丢失的问题。
方案二:
Hadoop提供的另一种方案就是NFS(网络文件系统),一种即时备份Namenode元数据的方案,设置多个data目录(包括NFS目录),让Namenode在持久化元数据的时候同时写入多个目录,这种方案较第一种方案的优势是能避免数据的丢失(这里我们暂时不讨论NFS本身会丢失数据的可能性,毕竟这种几率很小很小)。
缺点:
1.namenode的IP映射及访问问题,重新构造namenode可能导致客户端访问IP不一致,但可以在备用namenode投入使用的时候,配置其IP和原namenode一致
2.NFS服务器宕机导致集群瘫痪,可配置NFS集群来确保NFS的可用性。
3.重新构造namenode的时延问题,不能确保故障发生时能立即投入使用,对于需要即时使用的项目建议采用namenode热备方案。这是最关键的,会有中断。这对于高可用性集群是不可接受的
方案三:
用Vmware搭建虚拟集群,对NameNode节点进行镜像备份,如果NameNode挂掉,那么立刻换上镜像备份的机器,使其可用性变高。
缺点:这个不是Hadoop内置的解决方案,而且Vmware这一套东西也并不便宜。
注:这个方案三,我并不熟悉,只是知道有,但还没试过。引自: http://www.iyunv.com/Linux/2012-06/63563.htm
Hadoop 2.0.0-alpha版本给出的解决方案
方案四:
在同一个集群中配置两个的Namenode,其中一个处于活动状态,当发生崩溃时,可以快速的转移到冗余的Namenode上。
在一个典型的高可用性集群上,两个NameNode被配置在两个分开的机器上。在任何时刻,这两个NameNode只能有一个处于活动状态,而另一个处于备用状态。这在集群中处于活动状态的NameNode负责全部客户端的请求,而处于备用状态的NameNode时刻备份着活动NameNode中的数据,以便在活动的NameNode崩溃时提供一个快速的失效备援。
在当前的Hadoop2.0.0版式本中,提供了一种机制可以使处于备用状态的Namenode中的数据与处于活动状态的Namenode中的数据同步,这种机制的实现必须需要这两个NameNode可访问在一个共享存储设备(比如:来自NAS上的NFS)上的目录。这种限制在将来的版本中极有可能会放松。
当namespace被处于活动状态的NameNode修改时,这个修改操作被持久化的写入到共享目录里的一个编辑日志文件里。而处于备用状态的NameNode不断的查看这个共享目录中的编辑日志文件,一旦发现这个编辑日志文件有变化,就把它们拷贝到自己的namespace里。当处于活动状态的NameNode崩溃时,处于备用状态的NameNode代替崩溃的NameNode成为处于活动状态的NameNdoe,而在此之前处于备用状态的NameNode会确保它从共享目录中全部读取了编辑日志中的记录,这样就确保了在失效备援以前这两个NameNode中的namespace是完全同步的。
为提供快速的失效备援, 需要处于备份状态的NameNode结点有集群中块位置的最新信息,为了实现这一点,处于这两个NameNode管理的所有DataNodes,都需要向这两个NameNode发送块信息和心跳信息。
对正确操作高可用性的集群而言,至关重要的一点,是在在任何时刻这两个NameNode只参有一个NameNode处于活动状态,否则namespace将会处于不一致状态,这将会导致数据丢失或其他不可知结果。为了保证这点和防止“split-brain”场景的出现,管理员必须至少设置一个方法对共享存储进行保护。当失效备援时,如果不能验证先前处于活动状态的NameNode结点被它已经放弃了活动状态,那先前设置的保护方法有责任切断先前处于活动状态的Namenode的访问共享编辑日志文件的通道。这可以阻止先前处于活动状态Namenode对namespace的修改,并且允许当前新的处于活动状态的NameNode安全的处理失效备援。
注意:当前只实现了手动的失效备援,这意味着高可用的NameNodes没有自动识别活动的NameNode崩溃的能力,只能依靠操作员手动的恢复。自动失败的检测和恢复会在未来的版本中实现。
方案四:
所需的硬件资源:
为了部署一个HA群集,应该准备以下几点:
NameNode machines – 运行活动和备份的NameNode结点的机器应该有相同的配置。
Shared storage – 需要有一个让两个NameNode都可以读/写的共享目录,通常,这是一个远程文件管理器,它支持NFS并且安置在每一个NameNode machines上。当前只支持单共享编辑目录。因此,系统的可用性就被共享目录的可用性限制了,因此为了清除掉所有的SPOF问题,这需要冗余的共享编辑目录。
这方案四就是Hadoop 2.0.0-alpha要做的事,但是当前只实现了手动的失效备援,这意味着高可用的NameNodes没有自动识别活动的NameNode崩溃的能力,只能依靠操作员手动的恢复。自动失败的检测和恢复会在未来的版本中实现。 |
|