【实践】基于Prometheus函数的动态告警规则

对于监控来说,告警规则是告警检测发出的基本,当前市面上几乎所有的开源告警软件提供的规则基本上都是静态告警规则或者范围告警

先看下官方文档

官方函数英文注释

Holt-Winters is similar to a weighted moving average, where historical data has exponentially less influence on the current data.
Holt-Winter also accounts for trends in data. The smoothing factor (0 < sf < 1) affects how historical data will affect the current data. A lower smoothing factor increases the influence of historical data. The trend factor (0 < tf < 1) affects
how trends in historical data will affect the current data. A higher trend factor increases the influence.
of trends. Algorithm taken from https://en.wikipedia.org/wiki/Exponential_smoothing titled: "Double exponential smoothing".

中文注释:
Holt-Winters类似于加权移动平均,其中历史数据对当前数据的影响呈指数级减小。
Holt-Winters也解释了数据的趋势。平滑因子(0 < sf < 1)影响历史数据对电流的影响
数据。平滑因子越低,历史数据的影响越大。趋势因子(0 < tf < 1)影响
历史数据的趋势将如何影响当前数据。趋势因子越高,影响越大的趋势。算法取自https://en.wikipedia.org/wiki/Exponential_smoothing标题:“双指数平滑”

官方函数代码

func funcHoltWinters(vals []parser.Value, args parser.Expressions, enh *EvalNodeHelper) Vector {
	samples := vals[0].(Matrix)[0]

	// The smoothing factor argument.
	sf := vals[1].(Vector)[0].V

	// The trend factor argument.
	tf := vals[2].(Vector)[0].V

	// Sanity check the input.
	if sf <= 0 || sf >= 1 {
		panic(errors.Errorf("invalid smoothing factor. Expected: 0 < sf < 1, got: %f", sf))
	}
	if tf <= 0 || tf >= 1 {
		panic(errors.Errorf("invalid trend factor. Expected: 0 < tf < 1, got: %f", tf))
	}

	l := len(samples.Points)

	// Can't do the smoothing operation with less than two points.
	if l < 2 {
		return enh.Out
	}

	var s0, s1, b float64
	// Set initial values.
	s1 = samples.Points[0].V
	b = samples.Points[1].V - samples.Points[0].V

	// Run the smoothing operation.
	var x, y float64
	for i := 1; i < l; i++ {

		// Scale the raw value against the smoothing factor.
		x = sf * samples.Points[i].V

		// Scale the last smoothed value with the trend at this point.
		b = calcTrendValue(i-1, tf, s0, s1, b)
		y = (1 - sf) * (s1 + b)

		s0, s1 = s1, x+y
	}

	return append(enh.Out, Sample{
		Point: Point{V: s1},
	})
}

calcTrendValue注释

 Calculate the trend value at the given index i in raw data d.
This is somewhat analogous to the slope of the trend at the given index.
The argument "tf" is the trend factor.
The argument "s0" is the computed smoothed value.
The argument "s1" is the computed trend factor.
The argument "b" is the raw input value.

中文注释:

  • 计算原始数据d中给定索引i处的趋势值。
  • 这有点类似于给定指数处趋势的斜率。
  • 参数“tf”是趋势因子。
  • 参数"s0"是计算得到的平滑值。
  • 参数"s1"是计算出来的趋势因子。
  • 参数“b”是原始输入值。

calcTrendValue代码

func calcTrendValue(i int, tf, s0, s1, b float64) float64 {
	if i == 0 {
		return b
	}
	x := tf * (s1 - s0)
	y := (1 - tf) * b
	return x + y
}

代码观后感

对于我来说这个代码看的大概就是一个算法的思路,如何基于当前给定的序列值,再根据平滑因子、历史数据比重去算一个趋势值

实践

基于网络流量做突增突降异常检测

根据机器入口流量做异常突增突降告警,因为流量对于现网来说是一个周期性变化的值,根据现网请求数据的稳定性,在一定的时间内流量时趋于一定规律进行下降和上升,如果有大比例的下降或上升就回有问题
1.Prometheus查询流量语句

sum(irate(node_network_receive_bytes_total{app="01",idc="bj",group="test",device=~"eth1"}[5m] offset 1m)*8)by(device)

2.Prometheus record offset 5m

groups:
- name: instance_net_record_rules
  rules:
  - record: net_offset_5m:node_network_receive:bj_test_01
    expr: sum(irate(node_network_receive_bytes_total{app="01",idc="bj",group="test",device=~"eth1"}[1m] offset 5m)*8)by(device)

正式的告警规则
中文注释:app="01",idc="bj",group="test"的机器eth1网卡入口流量和五分钟前流量对比 波动超30%

abs(sum(irate(node_network_receive_bytes_total{app="01",idc="bj",group="test",device=~"eth1"}[1m])*8)by(device) - holt_winters(net_offset_5m:node_network_receive:bj_test_01[1h],0.5,0.8))/1024^2 > holt_winters(net_offset_5m:node_network_receive:bj_test_01[1h],0.5,0.8)*0.3/1024^2 >5

拆分解析:

  • abs是为了去正整数
  • holt_winters(net_offset_5m:node_network_receive:bj_test_01[1h],0.5,0.8))/1024^2 根据一分钟前所得的数据进行平滑拟合除当前大概的数据
  • 1024^2 流量单位转换
  • 大于5 为了去除最低谷时流量较低告警
    4.对比
    holt_winters(net_offset_5m:node_network_receive:bj_test_01[1h],0.5,0.8)等同于
    sum(irate(node_network_receive_bytes_total{app="01",idc="bj",group="test",device=~"eth1"}[1m])*8)by(device)
    效果图: