Instrument Your Infrastructure With Prometheus

Here’s how we use the Prometheus time-series database for insight into our systems.

Data powers not only the decisions behind Man Group’s financial research and systematic trading models, but also informs our approach to software engineering and operations.

Here’s how we use the Prometheus time-series database for insight into our systems.

Motivating measurement

Let’s take the very general case of an application that has suffered some misfortune. It may be running sluggishly. Perhaps interactive performance is lagging, request latency is up, or it’s not consuming messages from a queue fast enough. Or it may have simply stopped: it could be entirely unresponsive, or have exited ungracefully.

Why has this happened? Logs may help, but it’s not always clear where to begin looking. An application can have dependencies on other services, and it’s always going to depend on the infrastructure on which it is running: CPU, memory, I/O and network.

If the problem is caused by pressure on a resource which is shared with other processes (e.g. memory or I/O), the bottleneck may present gradually or intermittently and then grow in severity, scale and impact as the resource becomes more contested. On the other hand, there are situations where an application shows no misbehaviour before it ceases to work entirely, as for example, if it runs out of disk space.

The number of potential problems which can beset a service is large, but if we measure and collate the data about all of our dependencies up front, we can make troubleshooting simpler, and prevent many classes of problems before they manifest.

 

Prometheus: no half measures

Prometheus is an open source monitoring system and time series database to collect and aggregate observations about systems. We can satisfy our curiosity about the recent behaviour of a system by querying the metrics that Prometheus has collected.

To follow trends in real time, visualise our data and bring together related systems, we use Grafana to plot the metrics on a dashboard. Furthermore, Prometheus’ Alertmanager triggers alerts on unusual conditions and the symptoms of problems, so we get notified when something is awry.

Prometheus gets its metrics by pulling them in from HTTP endpoints. If you’ve got the source code for an application, you can integrate the Prometheus client libraries directly to expose metrics on the behaviour of your service.

But what about hardware? Even if an appliance doesn’t support Prometheus natively, so long as it has an API with the data, we can write a small webservice to query that API and serve up those metrics.

Now with both software and hardware metrics in the same place, we have a unified interface for monitoring the aggregate behaviour at each layer: from what the trading and business software is doing, all the way down to the performance of infrastructure and appliances.

The reality is messier than this when you add logs, tracing and events into the picture - what’s sometimes called “observability”. And metrics are lossy - we don’t catch every event. But the power of Prometheus is to give us a platform to inform our intuition about our systems over time.

Prometheus has been immensely useful for giving us that first inkling when we need to know: “what’s happening?”

 

Instrument your infrastructure

Recently we’ve been working on gathering metrics for our storage systems. The Pure Storage FlashBlade is our storage appliance of choice for I/O intensive, parallel workloads such as Spark and our build infrastructure, and we want to ensure it’s in tippy top condition. Moreover, by looking at the historical data, we can identify trends and plan future I/O and storage capacity.

It’s worth pointing out that there’s a plugin for telegraf which gathers metrics from SNMP agents, so if your appliance exposes data via SNMP, you can get it into Prometheus that way.

If you’re writing an exporter from scratch, you’ll need to parse the output from the API of your appliance and then expose each metric in one of the Prometheus metric types: Counter, Gauge, Histogram and Summary.

Most metrics fall into the Counter or Gauge category, but if you find yourself defining both a count of an event and a sum of its values so that you can get its average, a Summary may be a more appropriate choice. For some metrics, you’re most interested in the distribution. For example, having a slow average response time is a different problem than having a very slow 99th percentile. Summaries and Histograms help capture these subtleties.

You’ll also want to add labels: strings which describe and classify your timeseries. With the labels, the timeseries are tagged with the properties of your target (e.g. software version, datacenter location). Later, you can use regex in your Prometheus queries to pick out metrics from particular sources.

Once you’re done representing the appliance API responses as Prometheus metric types, start the server and check that all looks well at the HTTP endpoint. One nice aspect of Prometheus’ ‘pull’ architecture is that you can go to the metrics pages in a web browser and see that everything makes sense. Don’t forget to make sure that the endpoint is included in the list of targets of your Prometheus installation.

 

Open sourcing the FlashBlade exporter

If you’re lucky enough to own a Pure Storage FlashBlade, and want to collect Prometheus metrics for it, our Prometheus FlashBlade exporter has been open sourced here: https://github.com/manahl/prometheus-flashblade-exporter.

You can build and deploy it as a static, standalone binary. Or run it as a Docker container.

At Man Group, we love not only metrics, but also open source, and welcome contributions.