Implementing SLOs using Prometheus and Grafana
Online services should aim to provide a service availability that matches business requirements. A key part of this process should involve different teams in an organization, for example, from the business development team to the engineering team.
To verify how a service complies with these targets, it should be possible to define "thresholds" with measurable "achievements" of these, for example, "Service must be available 99.9% of the time", which should in turn match users' expectations and business continuity.
SLAs, SLOs, SLIs word soup
There's a lot already written about topics:
If you are not familiar with these terms, I would strongly recommend reading the article from Google's SRE book on Service Level Objectives first.
In summary:
SLAs: Service Level Agreement
- What service you commit to provide to users, with possible penalties if you are not able to meet it.
- Example: "99.5%" availability.
- Keyword: contract
SLOs: Service Level Objective
- What you have internally set as a target, driving your measuring threshold (for example, on dashboards and alerting). In general, it should be stricter than your SLA.
- Example: "99.9%" availability (the so called "three 9s").
- Keyword: thresholds
SLIs: Service Level Indicators
- What you actually measure, to assertain whether your SLOs are on/off-target.
- Example: error ratios, latency
- Keyword: metrics
SLOs are about time
So what does 99%
availability mean? - It's not
1% of error ratio (percentage of failed http responses), but instead
the percentage of time over a predefined period the service has been
available.

In the dashboard above, the service went above 0.1% error ratio (0.001 in
the y-axis) for 1 hour (the small red horizontal segment on top of the errors
spike), thus giving a 99.4%
availability over a 7 day period:
A key factor in this result is the time span you choose to measure availability (7 days in above example). Shorter periods are typically used as checkpoints for the engineering teams involved (for example, SRE and SWE) to track how the service is doing, while longer periods are usually used for review purposes by the organization / wider-team.
For example, if you set a 99.9%
SLO, then the total time
the service can be down would be the following:
- during 30 days: 43 min (3/4 hours)
- during 90 days: 129 min (~2 hours)
Another trivial "numbers fact" is that adding extra 9s to the SLO has an obvious exponential impact. See the following time fractions for a total 1 year period span:
- 2×9s:
99%
: 5250min (87hrs or 3.64days) - 3×9s:
99.9%
: 525min (8.7hrs) - 4×9s:
99.99%
: 52.5min - 5×9s:
99.999%
: 5min <- rule of approximation:5× 9s -> 5 mins
(per year)
Enter error budgets
The above numbers for the allowed time a service can be down may be thought of as an error budget, which you consume from events such as the following:
- planned maintenance
- failed upgrades
- unexpected outages
The practical outcome is that any of above will consume error budget from your service, for example, an unexpected outage may deplete it to the point of blocking further maintenance work during that time period.
SLIs are about metrics
From the above, it's clear that we must have service metrics to tell us when the service is considered (un)available. There are several approaches for this:
- RED\: Rate, Errors, Duration - introduced by @tom_wilkie
- USE\: Utilization, Saturation, and Errors - introduced by @brendangregg
Example SLO implementation
Let's take a specific example, following the RED method (as the metrics we already have available are a better match for this approach): create alerts and dashboards to support a target SLO for the Kubernetes API, via tools commonly used for monitoring purposes: [Prometheus] and [Grafana].
Additionally we'll use [jsonnet] to build our rules and dashboards files, taking advantage of existing library helpers.
Rather than explaining how to signal when your service is out of the thresholds, this article focuses on how to record the time the service has been under this condition, as discussed in SLOs are about time section.
The rest of the article will focus on creating Prometheus rules to capture "time out of SLO", based on thresholds for specific metrics (SLIs).
Define the SLO target and metrics thresholds
Let's define a simple target:
- SLO:
99%
, from the following: - SLIs:
- error ratio under
1%
- latency under
200ms
for90th
percentile of requests
- error ratio under
Writing above spec as jsonnet (see [spec-kubeapi.jsonnet]):
slo:: {
target: 0.99,
error_ratio_threshold: 0.01,
latency_percentile: 90,
latency_threshold: 200,
},
Finding the SLIs
The Kubernetes API exposes several metrics we can use as SLIs,
using the Prometheus rate()
function over a short period (here we choose 5min,
this number should be a few times your scraping interval):
apiserver_request_count
: counts all the requests byverb
,code
,resource
, e.g. to get the total error ratio for the last 5min:sum(rate(apiserver_request_count{code=~"5.."}[5m])) / sum(rate(apiserver_request_count[5m]))
The formula above discards all metrics labels (for example, by http
verb
,code
). If you want to keep some labels, you'd need to do something similar to the following:sum by (verb, code) (rate(apiserver_request_count{code=~"5.."}[5m])) / ignoring (verb, code) group_left sum (rate(apiserver_request_count[5m]))
apiserver_request_latencies_bucket
: latency histogram byverb
. For example, to get the 90th latency quantile in milliseconds: (note that thele
"less or equal" label is special, as it sets the histogram buckets intervals, see [Prometheus histograms and summaries][promql-histogram]):histogram_quantile ( 0.90, sum by (le, verb, instance)( rate(apiserver_request_latencies_bucket[5m]) ) ) / 1e3
Learn more at:
Writing Prometheus rules to record the chosen SLIs
PromQL is a very powerful language, although as of October 2018, it doesn't
yet support nested sub queries for ranges
(see Prometheus issue 1227 for details), a feature
we'll need to be able to compute time ratio
for error ratio
or latency
outside their thresholds.
Also, as good practice, to lower query-time Prometheus resource
usage, it is recommended to always add recording rules
to precompute expressions such as sum(rate(...))
anyway.
As an example of how to do this, the following set of recording rules are
built from our [bitnami-labs/kubernetes-grafana-dashboards] repository to
capture the above time ratio
:
Create a new
kubernetes:job_verb_code_instance:apiserver_requests:rate5m
metric to record requests rates:record: kubernetes:job_verb_code_instance:apiserver_requests:rate5m expr: | sum by(job, verb, code, instance) (rate(apiserver_request_count[5m]))
Using above metric, create a new
kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m
for the requests ratios (over total):record: kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m expr: | kubernetes:job_verb_code_instance:apiserver_requests:rate5m / ignoring(verb, code) group_left() sum by(job, instance) ( kubernetes:job_verb_code_instance:apiserver_requests:rate5m )
Using above ratio metrics (for every http
code
andverb
), create a new one to capture the error ratios:record: kubernetes:job:apiserver_request_errors:ratio_rate5m expr: | sum by(job) ( kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m {code=~"5..",verb=~"GET|POST|DELETE|PATCH"} )
Using above error ratios (and other similarly created
kubernetes::job:apiserver_latency:pctl90rate5m
one for recorded 90th percentile latency over the past 5mins, not shown above for simplicity), finally create a boolean metric to record our SLO complaince:record: kubernetes::job:slo_kube_api_ok expr: | kubernetes:job:apiserver_request_errors:ratio_rate5m < bool 0.01 * kubernetes::job:apiserver_latency:pctl90rate5m < bool 200
Writing Prometheus alerting rules
The above kubernetes::job:slo_kube_api_ok
final metric is very useful for
dashboards and accounting for SLO compliance, but we should alert on which of
above metrics is driving the SLO off, as shown the following Prometheus alert rules:
Alert on high API error ratio:
alert: KubeAPIErrorRatioHigh expr: | sum by(instance) ( kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m {code=~"5..",verb=~"GET|POST|DELETE|PATCH"} ) > 0.01 for: 5m
Alert on high API latency
alert: KubeAPILatencyHigh expr: | max by(instance) ( kubernetes:job_verb_instance:apiserver_latency:pctl90rate5m {verb=~"GET|POST|DELETE|PATCH"} ) > 200 for: 5m
Note that the Prometheus rules are taken from the already manifested jsonnet output, which can be found
in [our sources][bitnami-labs/kubernetes-grafana-dashboards] and the
thresholds are evaluated from $.slo.error_ratio_threshold
and
$.slo.latency_threshold
respectively.
Programmatically creating Grafana dashboards
Creating Grafana dashboards is usually done by interacting with the UI. This is fine for simple and/or "standard" dashboards (as for example, downloaded from https://grafana.com/dashboards), but becomes cumbersome if you want to implement best devops practices, especially for gitops workflows.
The community is addressing this issue via efforts, such as Grafana
libraries for jsonnet, python, and
Javascript. Given our jsonnet
implementation, we chose
grafonnet-lib.
One very useful outcome of using jsonnet
to set our SLO thresholds
and code our Prometheus rules, is that we can re-use these to build our
Grafana dashboards, without having to copy and paste them, that is, we keep
a single source of truth for these.
For example:
referring to
$.slo.error_ratio_threshold
in our Grafana dashboards to set Grafana graph panel'sthresholds
property, as we did above for our Prometheus alert rules.referring to created Prometheus recorded rules via
jsonnet
, an excerpt from [spec-kubeapi.jsonnet], note themetric.rules.requests_ratiorate_job_verb_code.record
usage (instead of verbatim'kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m'
):// Graph showing all requests ratios req_ratio: $.grafana.common { title: 'API requests ratios', formula: metric.rules.requests_ratiorate_job_verb_code.record, legend: '{{ verb }} - {{ code }}', },
You can read our implementation at dash-kubeapi.jsonnet, the following is a screenshot of the resulting dashboard:

Putting it all together
We implemented above ideas in our bitnami-labs/kubernetes-grafana-dashboards
repository, under the jsonnet
folder.
Our built Prometheus rules and Grafana dashboard files get produced from the jsonnet sources as the following:

- [spec-kubeapi.jsonnet]\: as much data-only specification as possible
(thresholds, rules and dashboards formulas)
- rules-kubeapi.jsonnet\: outputs Prometheus recording rules and alerts
- dash-kubeapi.jsonnet\: outputs Grafana dashboards, using grafonnet-lib via our opinionated bitnami_grafana.libsonnet.
Since we started this project, many other useful Prometheus rules have been created by the community. Check srecon17_americas_slides_wilkinson.pdf for more information on this. If we had to start from scratch again, we'd likely be using the kubernetes-mixin together with jsonnet-bundler.
In this tutorial
- SLAs, SLOs, SLIs word soup
- SLOs are about *time*
- Enter error budgets
- SLIs are about *metrics*
- Example SLO implementation
- Define the SLO target and metrics thresholds
- Finding the SLIs
- Writing Prometheus rules to record the chosen SLIs
- Writing Prometheus alerting rules
- Programmatically creating Grafana dashboards
- Putting it all together