es集群状态yellow排查

问题背景：

项目中全文检索接口响应时间超30s，排查接口逻辑，耗时主要花在es查询上，故对es集群进行排查。把接口请求生成的dsl拿去kibana中执行，发现响应时间确实太长，于是开始排查es健康问题

通过es命令对集群情况进行分析，得到以下结果：

1.集群健康状况为yellow，存在大量副本分片未分配情况；

{
  "cluster_name" : "cdb*",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : ***,
  "number_of_data_nodes" : ***,
  "active_primary_shards" : ***,
  "active_shards" : ***,
  "relocating_shards" : ***,
  "initializing_shards" : ***,
  "unassigned_shards" : 214, // ~注意看这里
  "delayed_unassigned_shards" : ***,
  "number_of_pending_tasks" : ***,
  "number_of_in_flight_fetch" : ***,
  "task_max_waiting_in_queue_millis" : ***
}

2.某个节点因位置原因导致连接不上，集群触发分片恢复；(1.把所有丢失的副本分片重新分配到集群其他健康节点中2.rebalancing操作)

{
 "unassigned_info": {
  "reason": "NODE_LEFT",
  "at": "2020-11-20T03:12:16",
  "details": "node_left ***",
  "last_allocation_status": "no_attempt"
 }
}

3.分片恢复并发数（源节点并发数和目标节点并发数）使用的默认设置，导致分片恢复并发拉满，恢复速度过慢；

（cluster.routing.allocation.node_concurrent_incoming_recoveries=2、cluster.routing.allocation.node_concurrent_outgoing_recoveries=2）

问题描述：
{
 "node_id": "***",
 "node_name": "mastersha",
 "transport_address": "***",
 "node_decision": "throttled",
 "deciders": [{
  "decider": "throttling",
  "decision": "THROTTLE",
  "explanation": "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
 }]
}

{
 "node_id": "***",
 "node_name": "master",
 "transport_address": ***,
 "node_decision": "no",
 "store": {
  "matching_sync_id": true
 },
 "deciders": [{
   "decider": "same_shard",
   "decision": "NO",
   "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[index_execution][2], node[***], [P], s[STARTED], a[id=***]]"
  },
  {
   "decider": "throttling",
   "decision": "THROTTLE",
   "explanation": "reached the limit of outgoing shard recoveries [2] on the node [***] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
  }
 ]
}

注:

ES性能分析用到的一些DSL命令：

GET _cat/health
GET _cluster/health
GET _cat/nodes
GET _cluster/health?level=indices
GET _cluster/health?level=shards
GET _cluster/allocation/explain
GET _cat/indices
GET _cluster/state

es集群状态yellow排查

问题背景：

1.集群健康状况为yellow，存在大量副本分片未分配情况；

2.某个节点因位置原因导致连接不上，集群触发分片恢复；(1.把所有丢失的副本分片重新分配到集群其他健康节点中2.rebalancing操作)

3.分片恢复并发数（源节点并发数和目标节点并发数）使用的默认设置，导致分片恢复并发拉满，恢复速度过慢 ；

浏览过的版块

3.分片恢复并发数（源节点并发数和目标节点并发数）使用的默认设置，导致分片恢复并发拉满，恢复速度过慢；