问题背景:
项目中全文检索接口响应时间超30s,排查接口逻辑,耗时主要花在es查询上,故对es集群进行排查。把接口请求生成的dsl拿去kibana中执行,发现响应时间确实太长,于是开始排查es健康问题
通过es命令对集群情况进行分析,得到以下结果:
1.集群健康状况为yellow,存在大量副本分片未分配情况;
{
"cluster_name" : "cdb*",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : ***,
"number_of_data_nodes" : ***,
"active_primary_shards" : ***,
"active_shards" : ***,
"relocating_shards" : ***,
"initializing_shards" : ***,
"unassigned_shards" : 214, // ~注意看这里
"delayed_unassigned_shards" : ***,
"number_of_pending_tasks" : ***,
"number_of_in_flight_fetch" : ***,
"task_max_waiting_in_queue_millis" : ***
}
2.某个节点因位置原因导致连接不上,集群触发分片恢复;(1.把所有丢失的副本分片重新分配到集群其他健康节点中2.rebalancing操作)
{
"unassigned_info": {
"reason": "NODE_LEFT",
"at": "2020-11-20T03:12:16",
"details": "node_left ***",
"last_allocation_status": "no_attempt"
}
}
3.分片恢复并发数(源节点并发数和目标节点并发数)使用的默认设置,导致分片恢复并发拉满,恢复速度过慢 ;
(cluster.routing.allocation.node_concurrent_incoming_recoveries=2、cluster.routing.allocation.node_concurrent_outgoing_recoveries=2)
问题描述:
{
"node_id": "***",
"node_name": "mastersha",
"transport_address": "***",
"node_decision": "throttled",
"deciders": [{
"decider": "throttling",
"decision": "THROTTLE",
"explanation": "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}]
}
{
"node_id": "***",
"node_name": "master",
"transport_address": ***,
"node_decision": "no",
"store": {
"matching_sync_id": true
},
"deciders": [{
"decider": "same_shard",
"decision": "NO",
"explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[index_execution][2], node[***], [P], s[STARTED], a[id=***]]"
},
{
"decider": "throttling",
"decision": "THROTTLE",
"explanation": "reached the limit of outgoing shard recoveries [2] on the node [***] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
}
注:
ES性能分析用到的一些DSL命令:
GET _cat/health
GET _cluster/health
GET _cat/nodes
GET _cluster/health?level=indices
GET _cluster/health?level=shards
GET _cluster/allocation/explain
GET _cat/indices
GET _cluster/state
|