Nagios 发出警报

OpenTSDB 很棒,但是还不是一个完整的监视平台。现在,您在 OpenTSDB 中拥有大量 Metrics,您希望在阈值变得过高时开始发送警报。这很容易!

tools目录中是 Python 脚本check_tsd。该脚本查询 OpenTSDB 并返回 Nagios 兼容的输出,该输出为您提供 OK/WARNING/CRITICAL 状态。

Parameters

Options:
  -h, --help            show this help message and exit
  -H HOST, --host=HOST  Hostname to use to connect to the TSD.
  -p PORT, --port=PORT  Port to connect to the TSD instance on.
  -m METRIC, --metric=METRIC
                        Metric to query.
  -t TAG, --tag=TAG     Tags to filter the metric on.
  -d SECONDS, --duration=SECONDS
                        How far back to look for data. Default 600s.
  -D METHOD, --downsample=METHOD
                        Downsample function, e.g. one of avg, min, sum, max ... etc
  -W SECONDS, --downsample-window=SECONDS
                        Window size over which to downsample.
  -a METHOD, --aggregator=METHOD
                        Aggregation method: avg, min, sum (default), max .. etc
  -x METHOD, --method=METHOD
                        Comparison method: gt, ge, lt, le, eq, ne.
  -r, --rate            Use rate value as comparison operand.
  -w THRESHOLD, --warning=THRESHOLD
                        Threshold for warning.  Uses the comparison method.
  -c THRESHOLD, --critical=THRESHOLD
                        Threshold for critical.  Uses the comparison method.
  -v, --verbose         Be more verbose.
  -T SECONDS, --timeout=SECONDS
                        How long to wait for the response from TSD.
  -E, --no-result-ok    Return OK when TSD query returns no result.
  -I SECONDS, --ignore-recent=SECONDS
                        Ignore data points that are that are that recent.
  -P PERCENT, --percent-over=PERCENT
                        Only alarm if PERCENT of the data points violate the
                        threshold.
  -N UTC, --now=UTC     Set unix timestamp for "now", for testing
  -S, --ssl             Make queries to OpenTSDB via SSL (https)

有关下采样和聚合模式的完整列表,请参见http://opentsdb.net/docs/build/html/user_guide/query/aggregators.html#available-aggregators

Nagios Setup

将脚本放入您的 Nagios 路径,并设置如下命令:

define command{
        command_name check_tsd
        command_line $USER1$/check_tsd -H $HOSTADDRESS$ $ARG1$
}

然后在 nagios 中为您的 TSD 服务器定义一个主机。您可以给它一个 check_command,如果后端运行状况良好,则保证它总是返回某些内容。

define host{
        host_name               tsd
        address                 tsd
        check_command           check_tsd!-d 60 -m rate:tsd.rpc.received -t type=put -x lt -c 1
        [...]
}

然后为要监视的内容定义一些服务检查。

define service{
        host_name                       tsd
        service_description             Apache too many internal errors
        check_command                   check_tsd!-d 300 -m rate:apache.stats.hits -t status=500 -w 1 -c 2
        [...]
}

Testing

如果要针对某个特定时间点测试参数,则可以使用--now <UTC>参数指定一个显式的 unix 时间戳记,该时间戳记用作当前时间戳记而不是实际当前时间。如果设置,脚本将获取以UTC - duration开始,以UTC结尾的数据。

要查看取回的值,并可能会忽略它们(由于持续时间),请使用--verbose选项。