|
Flink支持多种metric报告方式,例如prometheus, influbDB, JMX等。本篇主要记录Prometheus+Pushgateway的方式监控Flink任务。
00x准备工作
1. 下载Prometheus、Pushgateway和AlertManager
这三个工具都是Prometheus生态组件,下载地址:
https://prometheus.io/download/
01x配置
1. 将下载的组件解压到任意目录,如
/app/prometheus
/app/pushgateway
/app/alertmanager
2. 配置prometheus
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
instance: prometheus
- job_name: 'pushgateway'
static_configs:
- targets: ['localhost:9091']
labels:
instance: pushgateway
所有组件部署在同一台机器上,端口分别为 prometheus -> 9090, pushgateway -> 9091, alertmanager -> 9093
3. 启动prometheus和pushgateway
nohup /app/prometheus/prometheus --config.file=/app/prometheus/prometheus.yml --web.enable-admin-api > /app/log/prometheus.log 2>&1 &
nohup /app/pushgateway/pushgateway --web.listen-address=":9091" > /app/log/pushgateway.log 2>&1 &
访问网页,浏览是否配置正确。假设服务部署在192.168.1.100机器上:
访问 http://192.168.1.100:9090/targets 可以浏览 prometheus和pushgateway的状态,如果部署成功,如下图所示:

4. 配置flink
4.1 将flink-metrics-prometheus-1.x.x.jar 复制到 flink/lib 目录下
4.2 修改 flink-conf.yaml
metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: 192.168.1.100
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: flink-metrics-ppg
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: false
4.3 启动任意一个flink任务
4.4 访问 http://192.168.1.100:9091,可以看到

4.5 访问http://192.168.1.100:9090/graph查看指标数据

5. 配置alertmanager
修改 /app/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.xxx.com:25'
smtp_from: 'admin@xxx.com'
smtp_auth_username: 'xxx'
smtp_auth_password: 'xxx'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 5m
group_interval: 30m
repeat_interval: 1h
receiver: 'default-mail'
receivers:
- name: 'default-mail'
email_configs:
- to: 'abc@xxx.com'
可以用命令检查配置是否正确:/app/alertmanager/amtool check-config alertmanager.yml
6. 配置规则
例如规则 /app/prometheus/rules/test.yml
groups:
- name: No records
rules:
- alert: 没有数据流入流出
expr: sum by (job_name) (irate(flink_taskmanager_job_task_numRecordsInPerSecond[1h])+irate(flink_taskmanager_job_task_numRecordsInPerSecond[1h])) == 0
for: 5m # 告警持续时间,超过这个时间才会发送给alertmanager
labels:
severity: warning
annotations:
summary: "Job {{ $labels.job_name }} 没有数据流入"
description: "{{ $labels.job_name }} 持续1小时没有数据流入."
expr 的意思是基于job_name分组,1小时内flink_taskmanager_job_task_numRecordsInPerSecond + flink_taskmanager_job_task_numRecordsInPerSecond的总和等于0则告警。
表达式详细介绍:https://prometheus.io/docs/prometheus/latest/querying/basics/
验证规则:./promtool check rules rules/test.yml
7. 修改prometheus.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
8. 启动alertmanager
nohup /app/alertmanager/alertmanager --config.file=/app/alertmanager/alertmanager.yml > /app/log/alertmanager.log 2>&1 &
9. 重启prometheus
02x完成
满足条件,告警产生,发送邮件。

参考:
https://prometheus.io/docs/introduction/overview/
https://ci.apache.org/projects/flink/flink-docs-release-1.11/monitoring/metrics.html
https://www.jianshu.com/p/5e91a1ac2959 |