ElasticSearch 的一次非正常master脱离的调查 - ELK - 运维网 - Powered by Discuz! Archiver

论坛 › ELK › ElasticSearch 的一次非正常master脱离的调查

vlei 发表于 2017-5-21 10:17:03

ElasticSearch 的一次非正常master脱离的调查

一共有4个节点的cluster，其中es4 是master，某个时间突然es1脱离了整个cluster，调查过程如下：

$ date; ssh bd4 date
2012年 09月 03日星期一 09:41:26 CST
2012年 09月 03日星期一 09:41:00 CST

es4比 es1 慢 26 秒，以下日志时间修改为es1的时间

在es4的日志中：

写道
removed {],}, reason: zen-disco-node_failed(]), reason failed to ping, tried times, each with maximum timeout

这个时候，es4已经重试了3次，每次30s，也就是说90s以前，es1就已经出问题了，也就是42.22秒的时候。这个时间段es1发生了什么？

写道
duration , collections /, total /, memory ->/, all_pools { ->/}{ ->/}{ ->/}{ ->/}{ ->/}
duration , collections /, total /, memory ->/, all_pools { ->/}{ ->/}{ ->/}{ ->/}{ ->/}
duration , collections /, total /, memory ->/, all_pools { ->/}{ ->/}{ ->/}{ ->/}{ ->/}
duration , collections /, total /, memory ->/, all_pools { ->/}{ ->/}{ ->/}{ ->/}{ ->/}
Exception caught on netty layer []
java.io.IOException: 断开的管道

这期间，发生了 5.9s、5.8s、8.9s、1.4m的gc动作，特别是最后一个gc，长达1.4分钟，接近90s了。感觉应该正是这个gc导致es1无响应，从而从cluster当中被踢出去了。

很有意思的是，es1发现master es4不在了，它重新选举es3做为master，但是紧接着es3也失效了，日志如下：

写道
master_left []], reason
master {new ], previous ]}, removed {],}, reason: zen-disco-master_failed (])
master_left []], reason
master {new ], previous ]}, removed {],}, reason: zen-disco-master_failed (])

我们看一下es3当时发生了什么事情：
$ date; ssh bd3 date
2012年 09月 03日星期一 09:51:14 CST
2012年 09月 03日星期一 09:51:11 CST

它们只差3秒，es3的日志如下：

写道
removed {],}, reason: zen-disco-receive(from master []])
, node, , s: Failed to execute
org.elasticsearch.transport.RemoteTransportException: ]

奇怪，04:44:06左右，es3啥都没发生，感觉就是es3没理睬es1，因此es1只好又抛弃es3，把自己组建成独立的master。

现象找到了，该怎么解决了。两个思路：
1. 把gc的时间尽量再压缩，哪怕多进行几次gc，每次时间不要太长
2. 修改zen的配置，把fault detection的timeout时间和retry times 都增加。
第一步比较麻烦，还是第二步比较简单，先把retries修改为6试试看。

页: [1]

查看完整版本: ElasticSearch 的一次非正常master脱离的调查