Implementing SLOs using Prometheus and Grafana

Online services should aim to provide a service availability that matches business requirements. A key part of this process should involve different teams in an organization, for example, from the business development team to the engineering team.

To verify how a service complies with these targets, it should be possible to define "thresholds" with measurable "achievements" of these, for example, "Service must be available 99.9% of the time", which should in turn match users' expectations and business continuity.

SLAs, SLOs, SLIs word soup

There's a lot already written about topics:

If you are not familiar with these terms, I would strongly recommend reading the article from Google's SRE book on Service Level Objectives first.

In summary:

  • SLAs: Service Level Agreement

    • What service you commit to provide to users, with possible penalties if you are not able to meet it.
    • Example: "99.5%" availability.
    • Keyword: contract
  • SLOs: Service Level Objective

    • What you have internally set as a target, driving your measuring threshold (for example, on dashboards and alerting). In general, it should be stricter than your SLA.
    • Example: "99.9%" availability (the so called "three 9s").
    • Keyword: thresholds
  • SLIs: Service Level Indicators

    • What you actually measure, to assertain whether your SLOs are on/off-target.
    • Example: error ratios, latency
    • Keyword: metrics

SLOs are about time

So what does 99% availability mean? - It's not 1% of error ratio (percentage of failed http responses), but instead the percentage of time over a predefined period the service has been available.

SLO grafana dashboard screenshot

In the dashboard above, the service went above 0.1% error ratio (0.001 in the y-axis) for 1 hour (the small red horizontal segment on top of the errors spike), thus giving a 99.4% availability over a 7 day period:

SLO formula example

A key factor in this result is the time span you choose to measure availability (7 days in above example). Shorter periods are typically used as checkpoints for the engineering teams involved (for example, SRE and SWE) to track how the service is doing, while longer periods are usually used for review purposes by the organization / wider-team.

For example, if you set a 99.9% SLO, then the total time the service can be down would be the following:

  • during 30 days: 43 min (3/4 hours)
  • during 90 days: 129 min (~2 hours)

Another trivial "numbers fact" is that adding extra 9s to the SLO has an obvious exponential impact. See the following time fractions for a total 1 year period span:

  • 2×9s: 99%: 5250min (87hrs or 3.64days)
  • 3×9s: 99.9%: 525min (8.7hrs)
  • 4×9s: 99.99%: 52.5min
  • 5×9s: 99.999%: 5min <- rule of approximation: 5× 9s -> 5 mins (per year)

Enter error budgets

The above numbers for the allowed time a service can be down may be thought of as an error budget, which you consume from events such as the following:

  • planned maintenance
  • failed upgrades
  • unexpected outages

The practical outcome is that any of above will consume error budget from your service, for example, an unexpected outage may deplete it to the point of blocking further maintenance work during that time period.

SLIs are about metrics

From the above, it's clear that we must have service metrics to tell us when the service is considered (un)available. There are several approaches for this:

Example SLO implementation

Let's take a specific example, following the RED method (as the metrics we already have available are a better match for this approach): create alerts and dashboards to support a target SLO for the Kubernetes API, via tools commonly used for monitoring purposes: [Prometheus] and [Grafana].

Additionally we'll use [jsonnet] to build our rules and dashboards files, taking advantage of existing library helpers.

Rather than explaining how to signal when your service is out of the thresholds, this article focuses on how to record the time the service has been under this condition, as discussed in SLOs are about time section.

The rest of the article will focus on creating Prometheus rules to capture "time out of SLO", based on thresholds for specific metrics (SLIs).

Define the SLO target and metrics thresholds

Let's define a simple target:

  • SLO: 99%, from the following:
  • SLIs:
    • error ratio under 1%
    • latency under 200ms for 90th percentile of requests

Writing above spec as jsonnet (see [spec-kubeapi.jsonnet]):

slo:: {
  target: 0.99,
  error_ratio_threshold: 0.01,
  latency_percentile: 90,
  latency_threshold: 200,
},

Finding the SLIs

The Kubernetes API exposes several metrics we can use as SLIs, using the Prometheus rate() function over a short period (here we choose 5min, this number should be a few times your scraping interval):

  • apiserver_request_count: counts all the requests by verb, code, resource, e.g. to get the total error ratio for the last 5min:

    sum(rate(apiserver_request_count{code=~"5.."}[5m]))
     /
    sum(rate(apiserver_request_count[5m]))
    
  • The formula above discards all metrics labels (for example, by http verb, code). If you want to keep some labels, you'd need to do something similar to the following:

    sum by (verb, code) (rate(apiserver_request_count{code=~"5.."}[5m]))
      / ignoring (verb, code) group_left
    sum (rate(apiserver_request_count[5m]))
    
  • apiserver_request_latencies_bucket: latency histogram by verb. For example, to get the 90th latency quantile in milliseconds: (note that the le "less or equal" label is special, as it sets the histogram buckets intervals, see [Prometheus histograms and summaries][promql-histogram]):

    histogram_quantile (
      0.90,
      sum by (le, verb, instance)(
        rate(apiserver_request_latencies_bucket[5m])
      )
    ) / 1e3
    

Learn more at:

Writing Prometheus rules to record the chosen SLIs

PromQL is a very powerful language, although as of October 2018, it doesn't yet support nested sub queries for ranges (see Prometheus issue 1227 for details), a feature we'll need to be able to compute time ratio for error ratio or latency outside their thresholds.

Also, as good practice, to lower query-time Prometheus resource usage, it is recommended to always add recording rules to precompute expressions such as sum(rate(...)) anyway.

As an example of how to do this, the following set of recording rules are built from our [bitnami-labs/kubernetes-grafana-dashboards] repository to capture the above time ratio:

  • Create a new kubernetes:job_verb_code_instance:apiserver_requests:rate5m metric to record requests rates:

    record: kubernetes:job_verb_code_instance:apiserver_requests:rate5m
    expr: |
      sum by(job, verb, code, instance) (rate(apiserver_request_count[5m]))
    
  • Using above metric, create a new kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m for the requests ratios (over total):

    record: kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m
    expr: |
      kubernetes:job_verb_code_instance:apiserver_requests:rate5m
        / ignoring(verb, code) group_left()
      sum by(job, instance) (
        kubernetes:job_verb_code_instance:apiserver_requests:rate5m
      )
    
  • Using above ratio metrics (for every http code and verb), create a new one to capture the error ratios:

    record: kubernetes:job:apiserver_request_errors:ratio_rate5m
    expr: |
      sum by(job) (
        kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m
          {code=~"5..",verb=~"GET|POST|DELETE|PATCH"}
      )
    
  • Using above error ratios (and other similarly created kubernetes::job:apiserver_latency:pctl90rate5m one for recorded 90th percentile latency over the past 5mins, not shown above for simplicity), finally create a boolean metric to record our SLO complaince:

    record: kubernetes::job:slo_kube_api_ok
    expr: |
      kubernetes:job:apiserver_request_errors:ratio_rate5m < bool 0.01
        *
      kubernetes::job:apiserver_latency:pctl90rate5m < bool 200
    

Writing Prometheus alerting rules

The above kubernetes::job:slo_kube_api_ok final metric is very useful for dashboards and accounting for SLO compliance, but we should alert on which of above metrics is driving the SLO off, as shown the following Prometheus alert rules:

  • Alert on high API error ratio:

    alert: KubeAPIErrorRatioHigh
    expr: |
      sum by(instance) (
        kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m
          {code=~"5..",verb=~"GET|POST|DELETE|PATCH"}
      ) > 0.01
    for: 5m
    
  • Alert on high API latency

    alert: KubeAPILatencyHigh
    expr: |
      max by(instance) (
        kubernetes:job_verb_instance:apiserver_latency:pctl90rate5m
          {verb=~"GET|POST|DELETE|PATCH"}
      ) > 200
    for: 5m
    

Note that the Prometheus rules are taken from the already manifested jsonnet output, which can be found in [our sources][bitnami-labs/kubernetes-grafana-dashboards] and the thresholds are evaluated from $.slo.error_ratio_threshold and $.slo.latency_threshold respectively.

Programmatically creating Grafana dashboards

Creating Grafana dashboards is usually done by interacting with the UI. This is fine for simple and/or "standard" dashboards (as for example, downloaded from https://grafana.com/dashboards), but becomes cumbersome if you want to implement best devops practices, especially for gitops workflows.

The community is addressing this issue via efforts, such as Grafana libraries for jsonnet, python, and Javascript. Given our jsonnet implementation, we chose grafonnet-lib.

One very useful outcome of using jsonnet to set our SLO thresholds and code our Prometheus rules, is that we can re-use these to build our Grafana dashboards, without having to copy and paste them, that is, we keep a single source of truth for these.

For example:

  • referring to $.slo.error_ratio_threshold in our Grafana dashboards to set Grafana graph panel's thresholds property, as we did above for our Prometheus alert rules.

  • referring to created Prometheus recorded rules via jsonnet, an excerpt from [spec-kubeapi.jsonnet], note the metric.rules.requests_ratiorate_job_verb_code.record usage (instead of verbatim 'kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m'):

    // Graph showing all requests ratios
    req_ratio: $.grafana.common {
      title: 'API requests ratios',
      formula: metric.rules.requests_ratiorate_job_verb_code.record,
      legend: '{{ verb }} - {{ code }}',
    },
    

You can read our implementation at dash-kubeapi.jsonnet, the following is a screenshot of the resulting dashboard:

SLO Grafana dashboard screenshot

Putting it all together

We implemented above ideas in our bitnami-labs/kubernetes-grafana-dashboards repository, under the jsonnet folder.

Our built Prometheus rules and Grafana dashboard files get produced from the jsonnet sources as the following:

SLO jsonnet workflow
  • [spec-kubeapi.jsonnet]\: as much data-only specification as possible (thresholds, rules and dashboards formulas)

Since we started this project, many other useful Prometheus rules have been created by the community. Check srecon17_americas_slides_wilkinson.pdf for more information on this. If we had to start from scratch again, we'd likely be using the kubernetes-mixin together with jsonnet-bundler.

Do you like what we do in Engineering at Bitnami and Vmware?