整体上虽然数据没有丢失,但是仍然给线上系统带来一定的影响。分析了整个过程之后,预防这类问题出现应该
从如下几点入手:
1、相关参数
<property>
<name>hbase.regionserver.lease.period</name>
<value>240000</value>
<description>HRegion server lease period in milliseconds. Default is 60 seconds. Clients must report in within this period else they are considered dead.</description>
</property>
hbase.regionserver.lease.period的设置是Tradeoff,因为在HBase Client操作过程中,如果RegionServer的响应速度过慢或者出现了Long FullGC,超过了设置值,则抛出lease expired。
ps:hbase.regionserver.lease.period的设置要和hbase.rpc.timeout保持一致,最好hbase.rpc.timeout的值略大于hbase.regionserver.lease.period
<property>
<name>zookeeper.session.timeout</name>
<value>90000</value>
<description>Default 3 minutes.ZooKeeper session timeout. HBase passes this to the zk quorum as suggested maximum time for a session (This setting becomes zookeeper's 'maxSessionTimeout'). See http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions "The client sends a requested timeout, the server responds with the timeout that it can give the client. " In milliseconds. </description>
</property>
zookeeper.session.timeout指定zk Client与zk Server Quorum之间的会话的最长时间,RegionServerTracker监听rs的znode的变化,超过session time out的值,就认定该RegionServer出错了,然后通过ServerManager的expireServer将其加入Dead RegionServer Lists中。这样等RegionServer发送心跳给HMaster时,发现该RegionServer位于Dead RegionServer Lists中,会抛出YouAreDeadException。然后,RegionServer会开始执行shutdown系统操作。
注意:这里设置的zookeeper.session.timeout与zookeeper设置的ticktime共同影响session的生存周期,一般而言,sessionTimeOut= Min(应用设置的zookeeper.session.timeout,ticktime*20)