Metrics are the key to providing an overall view of distributed applications. They are essential for showing specific information, creating monitoring dashboards, and sending you alerts during your holiday. Prometheus is the cloud-native world’s de facto monitoring system, with its dimensional data model, flexible query language, efficient time-series database, and modern alerting approach. This article will tell you how to take Prometheus even further by integrating Thanos for high availability and long-term storage capabilities.

Let’s start with a recap of Prometheus, which steals the fire and brings more alerts and visibility to the cloud-native world.

Prometheus Recap

Prometheus is the leading instrumentation, collection, and storage tool that originated at SoundCloud in 2012. It contributed to the Cloud Native Computing Foundation in 2016 and graduated from the foundation in 2018. As of now, it is the most widely accepted monitoring tool for cloud-native applications. Prometheus’s popularity increased thanks to Kubernetes since they work well together to orchestrate and instrument containerized applications.

Prometheus can be separated into three major parts: the server, UI, and alert manager. The server connects to the configured targets to regularly collect metrics, evaluates rule expressions, and displays the results, which you can check from a web UI or via other visualization tools such as Grafana. (Our blog Prometheus and Grafana: The Perfect Combo has more info on the topic). Finally, the alert manager component can trigger alerts based on rules and route alerts to PagerDuty, Opsgenie, or Slack. 

Now, let’s see what happens when you use Prometheus over the long term.

Global Visibility in Prometheus

You installed your first production cluster, packed it with Prometheus, and created glamorous dashboards in Grafana. Then, the second cluster arrived. You installed Prometheus and Grafana again; you always opened two panels simultaneously, but it was bearable since you were scaling up. As the number of your clusters increased, you started sharing the links of Prometheus and Grafana instances daily. At some point, you started realizing that you are spending more time on the monitoring systems than the actual applications running in the clusters. 

In addition, analysts now want to cross-check application usage across clusters, and developers want to have an overall view of the running instances. Finally, management wants to have all information in one dashboard. 

It is possible to federate data from Prometheus instances into a single one; however, Prometheus is not designed as a metrics storage system. You will have out-of-memory (OOM) kills while running queries in the web UI. Also, the configuration of metrics sharding between Prometheus instances is not so straightforward. Meanwhile, the complexity of your monitoring stack will continue to exponentially increase and become hard to manage. Plus, you find yourself allocating more CPU and memory resources to Prometheus instances than the actual applications. So, the best solution is to extend Prometheus’s storage capacity to create a global view of the world.

The scalability and durability of Prometheus’ local storage is limited by its single nodes. However, its extendability allows you to integrate with remote storage systems. In the next section, we’ll dive into some long-term storage solutions for Prometheus.

Long-Term Storage (LTS) Solutions

Long-term storage solutions for Prometheus are just popping up and still in development. The prominent solutions in the market are Thanos by Improbable, Cortex by Weaveworks, and VictoriaMetrics. A comparison of these tools can be summarized as follows:

VictoriaMetrics Cortex Thanos Comparison

As the table shows, Thanos is the most appropriate candidate to extend your Prometheus setup with long-term storage. It gives you a global query view with high availability and backup options. Let’s quickly check how Thanos achieves these goals with the following components:

  • Sidecars read data for querying and upload it to cloud storage from the Prometheus instances living in the same pod or node.
  • Store Gateway serves metrics inside a cloud storage bucket such as S3 or Google Cloud Storage (GCS).
  • Querier implements a Prometheus query API for running global queries through multiple Prometheus instances and long-term object storage.
  • Ruler performs Prometheus recording and alerting rules over the data using the Querier. 
  • Compactor uses a gradual process of merging blocks of data in the cloud to store and operate them efficiently. 
  • Receiver accepts data from Prometheus via a remote_write API

The components and their interaction can be seen in the diagram below:

Thanos architecture

Fig 1: Thanos architecture (Source: Thanos)

Now it’s time to get your hands dirty and install Prometheus with Thanos as a long-term storage option.

Thanos and Prometheus in Action

We’ll start with a vanilla Prometheus setup and convert it into a Thanos-enabled deployment. The final installation and journey will have two significant outcomes: 

  • Reliable querying over multiple Prometheus instances from a single endpoint 
  • Seamless integration of highly available Prometheus instances

First, create two Prometheus instances in two improvised regions: eu-gryffindor and ap-ravenclaw.

Then, create two files with the names prometheus-eu-gryffindor.yml and prometheus-ap-ravenclaw.yml and the following content:

prometheus-eu-gryffindor.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: eu-gryffindor
    replica: 0
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9091']
prometheus-ap-ravenclaw.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: ap-ravenclaw
    replica: 0
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9092']

These are Prometheus configuration files, which will start by collecting metrics over the Prometheus instances as defined in scrape_config parts. Start the two Prometheus instances to collect metrics via the following code:

docker run -d -p 0.0.0.0:9091:9091 --rm \
-v $(pwd)/prometheus-eu-gryffindor.yml:/etc/prometheus/prometheus.yml \
-u root \
--name prometheus-eu-gryffindor \
quay.io/thanos/prometheus:v2.12.0-rc.0-rr-streaming \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.listen-address=:9091 \
--web.enable-lifecycle \
--web.enable-admin-api && echo "Prometheus EU-Gryffindor started"

docker run -d -p 0.0.0.0:9092:9092 --rm \
-v $(pwd)/prometheus-ap-ravenclaw.yml:/etc/prometheus/prometheus.yml \
-u root \
--name prometheus-ap-ravenclaw \
quay.io/thanos/prometheus:v2.12.0-rc.0-rr-streaming \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.listen-address=:9092 \
--web.enable-lifecycle \
--web.enable-admin-api && echo "Prometheus AP-Ravenclaw started"

Now, you should have two containers running with the names prometheus-eu-gryffindor and prometheus-ap-ravenclaw:

$ docker ps --format '{{.Names}}'
prometheus-ap-ravenclaw
prometheus-eu-gryffindor

Open http://localhost:9091 and http://localhost:9092 in the browser and check the number of metrics collected by running the query in the two hosts: prometheus_tsdb_head_series.

Prometheus instances and collected metrics

Fig 2: Prometheus instances and number of collected metrics

Now, you have two Prometheus instances collecting metrics, but they have no information or knowledge about each other. It’s time to deploy Thanos sidecars so that you can create an interconnection layer between Prometheus instances.

Create two sidecar containers to connect to each Prometheus instance with the following commands:

docker run -d -p 0.0.0.0:19091:19091 -p 0.0.0.0:19191:19191 --rm \
-v $(pwd)/prometheus-eu-gryffindor.yml:/etc/prometheus/prometheus.yml \
--link prometheus-eu-gryffindor:prometheus \
--name prometheus-sidecar-eu-gryffindor \
-u root \
quay.io/thanos/thanos:v0.7.0 \
sidecar \
--http-address 0.0.0.0:19091 \
--grpc-address 0.0.0.0:19191 \
--reloader.config-file /etc/prometheus/prometheus.yml \
--prometheus.url http://prometheus:9091 && echo "Started sidecar for Prometheus EU-Gryffindor"

docker run -d -p 0.0.0.0:19092:19092 -p 0.0.0.0:19192:19192 --rm \
-v $(pwd)/prometheus-ap-ravenclaw.yml:/etc/prometheus/prometheus.yml \
--link prometheus-ap-ravenclaw:prometheus \
--name prometheus-sidecar-ap-ravenclaw \
-u root \
quay.io/thanos/thanos:v0.7.0 \
sidecar \
--http-address 0.0.0.0:19092 \
--grpc-address 0.0.0.0:19192 \
--reloader.config-file /etc/prometheus/prometheus.yml \
--prometheus.url http://prometheus:9092 && echo "Started sidecar for Prometheus AP-Ravenclaw"

Now, you have installed your first Thanos component: Sidecar. Sidecar components will be your gateway to collecting metrics from Prometheus instances and creating a global view. 

Next, connect these sidecars with the Querier component:

docker run -d -p 0.0.0.0:29090:29090 --rm \
--name querier \
--link prometheus-sidecar-eu-gryffindor:prometheus-sidecar-eu-gryffindor \
--link prometheus-sidecar-ap-ravenclaw:prometheus-sidecar-ap-ravenclaw \
    quay.io/thanos/thanos:v0.7.0 \
    query \
    --http-address 0.0.0.0:29090 \
    --query.replica-label replica \
    --store prometheus-sidecar-eu-gryffindor:19191 \
    --store prometheus-sidecar-ap-ravenclaw:19192 && echo "Started Querier"

Now you can open http://localhost:29090 in the browser and access the Thanos Querier UI (also seamlessly integrates with Grafana). It is very similar to the Prometheus Web UI, except that you now have a global view of your clusters. 

Click the Stores to list the available data sources:

Thanos Sidecar Stores

Fig 3: Thanos Stores

Two sidecars from the two regions are up and providing metrics. Let’s check the same prometheus_tsdb_head_series metric in Thanos now:

Global Query View in Thanos

Fig 4: Global Query View in Thanos

As expected, you can now check metrics from both clusters. Without any change in your actual Prometheus instances, you have installed Thanos to collect metrics and create a global view with high availability. This smooth integration experience is one of the fundamental characteristics of Thanos.

In case you need to have high-availability for your Prometheus server, you can run multiple replica instances that scrape the same targets. Thanos is then able to de-duplicate these metrics transparently when querying – you can read more about this in Thanos docs.

Summary

In this article, we discussed the major obstacles and solutions for having highly available, multi-cluster monitoring. We then continued with a walkthrough on how to install Thanos to your existing Prometheus instances. As illustrated, you can horizontally scale with sidecar components while not losing a global query view, thanks to Querier.