es集群状态yellow排查

论坛 期权论坛 编程之家     
选择匿名的用户   2021-6-2 20:21   4088   0

问题背景

项目中全文检索接口响应时间超30s,排查接口逻辑,耗时主要花在es查询上,故对es集群进行排查。把接口请求生成的dsl拿去kibana中执行,发现响应时间确实太长,于是开始排查es健康问题

通过es命令对集群情况进行分析,得到以下结果:

1.集群健康状况为yellow,存在大量副本分片未分配情况;

{
  "cluster_name" : "cdb*",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : ***,
  "number_of_data_nodes" : ***,
  "active_primary_shards" : ***,
  "active_shards" : ***,
  "relocating_shards" : ***,
  "initializing_shards" : ***,
  "unassigned_shards" : 214, // ~注意看这里
  "delayed_unassigned_shards" : ***,
  "number_of_pending_tasks" : ***,
  "number_of_in_flight_fetch" : ***,
  "task_max_waiting_in_queue_millis" : ***
}

2.某个节点因位置原因导致连接不上,集群触发分片恢复;(1.把所有丢失的副本分片重新分配到集群其他健康节点中2.rebalancing操作)

{
 "unassigned_info": {
  "reason": "NODE_LEFT",
  "at": "2020-11-20T03:12:16",
  "details": "node_left ***",
  "last_allocation_status": "no_attempt"
 }
}

3.分片恢复并发数(源节点并发数和目标节点并发数)使用的默认设置,导致分片恢复并发拉满,恢复速度过慢 ;

(cluster.routing.allocation.node_concurrent_incoming_recoveries=2、cluster.routing.allocation.node_concurrent_outgoing_recoveries=2)

问题描述:
{
 "node_id": "***",
 "node_name": "mastersha",
 "transport_address": "***",
 "node_decision": "throttled",
 "deciders": [{
  "decider": "throttling",
  "decision": "THROTTLE",
  "explanation": "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
 }]
}

{
 "node_id": "***",
 "node_name": "master",
 "transport_address": ***,
 "node_decision": "no",
 "store": {
  "matching_sync_id": true
 },
 "deciders": [{
   "decider": "same_shard",
   "decision": "NO",
   "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[index_execution][2], node[***], [P], s[STARTED], a[id=***]]"
  },
  {
   "decider": "throttling",
   "decision": "THROTTLE",
   "explanation": "reached the limit of outgoing shard recoveries [2] on the node [***] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
  }
 ]
}

注:

ES性能分析用到的一些DSL命令:

GET _cat/health
GET _cluster/health
GET _cat/nodes
GET _cluster/health?level=indices
GET _cluster/health?level=shards
GET _cluster/allocation/explain
GET _cat/indices
GET _cluster/state

分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:3875789
帖子:775174
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP