ELK集群故障处理

lkjhgd · 发表于 2016-7-7 09:05:12

1.Logstash向Elasticsearch插入数据报错
报错信息如下：

{:timestamp=>"2016-07-06T00:02:22.289000+0800", :message=>"retrying failed action with response code: 429 ({\"type\"=>\"es_rejected_execution_exception\", \"reason\"=>\"rejected execution of org.elasticsearch.transport.TransportService$4@74bf8f58 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@48d6d348[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 787682]]\"})", :level=>:info}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

input {
  redis {
host => "xxxxx"
type => "redis-input"
data_type => "list"
key => "logstash"
threads => 60
batch_count => 50000
      }
}

output {
elasticsearch {
      hosts => ["xxxxx"]
      workers => 30
      flush_size => 50000
      idle_flush_time => 5
      index => "logstash-%{type}-%{+YYYY.MM.dd}"

               }
   }

Logstash从redis队列批量读取数据然后批量写入到elasticsearch，从报错信息来看初步判断是写入到elasticsearch的速度赶不上从redis读取的速度

Logstash的elasticsearch output插件的429错误表示Too many requests
Retry Policy  Logstash的这个插件使用Elasticsearch的bulk API

flush_size 这个参数默认是500，这个插件使用Elasticsearch的bulk index API来增强索引性能。在Logstash2.2以上版本这个参数设定定义了bulk requests的最大大小。可以根据input的读入大小来调整这个值。如果设置的值大于pipeline的batch size没有什么影响。

Logstash2.1之前这个插件使用它自己的事件内部缓冲区。这个参数就是调整缓冲区大小的。
为了高效使用elasticsearch的bulk API调用，我们将在刷新事件到Elasticsearch之前缓冲一定数量的事件。flush_size这个参数就是控制有多少事件在被批量写入到Elasticsearch之前需要被缓冲。增大flush_size的同时也增大Logstash的Heap大小，通过LS_HEAP_SIZE设置

这个报错还需了解Elasticsearch的Thread Pool即线程池
每个Elasticsearch节点都拥有好几个线程池用于增强性能，其中很多线程池也带有队列，允许将一些事件放入队列而不是丢弃。
通过  curl -XGET http://xxxx:9200/_nodes/stats/thread_pool?pretty  查看各个线程池的状态
比较重要的线程池有：
generic
用于常规操作，例如后台节点发现。Thread pool type is cached.
index
用于index/delete操作. Thread pool type is fixed,大小是CPU核心数，队列大小是200
search
用于计算和搜索操作。Thread pool type is fixed，大小是可用CPU核心数*3/2+1,队列大小是1000
suggest
用于suggest操作。Thread pool type is fixed，大小是可用CPU核心数，队列大小是1000
get
用于get操作。Thread pool type is fixed，大小是可用CPU核心数，队列大小是1000
bulk

   用于bulk操作。Thread pool type is fixed，大小是可用CPU核心数，队列大小是50

percolate

   用于percolate操作。Thread pool type is fixed,大小是可用CPU核心数，队列大小是1000
snapshot
用于快照和恢复操作。Thread pool type is scaling，维持5m的keep-alive，线程池大小是 min(5,可用CPU核心数/2),即如果CPU核心数是4，那么线程池大小就是4/2=2, 如果CPU核心数是32，那么线程池大小就是5
warmer
用于segment warm-up操作。Thread pool type is scaling，保持5m的keep-alive，线程池大小是 min(5,可用CPU核心数/2),即如果CPU核心数是4，那么线程池大小就是4/2=2, 如果CPU核心数是32，那么线程池大小就是5
refresh
用于refresh操作。Thread pool type is scaling，保持5m的keep-alive，线程池大小是 min(10,可用CPU核心数/2),即如果CPU核心数是4，那么线程池大小就是4/2=2, 如果CPU核心数是32，那么线程池大小就是5

listener
   主要用于当listener threaded 设置为true时，java 客户端执行操作。Thread pool type is scaling，保持5m的keep-alive，线程池大小是 min(10,可用CPU核心数/2),即如果CPU核心数是4，那么线程池大小就是4/2=2, 如果CPU核心数是32，那么线程池大小就是5

如果想要改变某个线程池的默认值可以这样：

1
2
3

threadpool:
index:
size: 30

或者写成 threadpool.index.size: 30

Thread pool types
cached
cached线程池不是一个绑定的线程池，当有待处理的请求时cached线程会衍生出一个线程。这种线程池是为了防止提交到线程池中的请求被阻断或者拒绝。这种线程池中没有用的线程会在5m左右的keep alive后终止。cached类型线程池保留给generic线程池
keep_alive 参数决定一个线程没事情做时应该在线程池中停留多长时间

1
2
3

threadpool:
generic:
keep_alive: 2m

fixed
fixed类型的线程池拥有固定大小的线程，通常这些线程池带有一个队列用于暂存那些来不及处理的请求。
size参数控制线程数量

1
2
3
4

threadpool:
index:
size: 30
queue_size: 1000

scaling
scaling类型线程池的线程数是动态变更的。根据负载情况，线程数在1和size之间变化。
keep_alive 参数决定一个线程没事情做时应该在线程池中停留多长时间

1
2
3
4

threadpool:
warmer:
size: 8
keep_alive: 2m

CPU核心数是自动检测的，如果检测到的CPU核心数不对，可以通过processors设置

Logstash提示这样的错误是因为bulk operations queue满了，要么调小flush_size的值，或者增大elasticsearch的thread

增大Elasticsearch的bulk线程池队列
threadpool.bulk.queue_size: 1000

将redis input的batch_count调整到5000，再看logstash没有报错了。问题解决

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

Red Hat RHCE 8 (EX294) Cert Guide

c++ size_t 和 int 的区别

HERE 使用 AWS EF 和 JFrog Artifactory 打

C++ 指针大全：从基础到进阶，一篇快速上手

wirelessnetview好用的无线分析工具

亿图图示专家(EDraw Max) V7.9 中文破解版

[经验分享] ELK集群故障处理

扫码加入运维网微信交流群