Metrics and Performance Monitoring with Prometheus
Monitoring/Metrics in general
What is Monitoring?
The term “service monitoring”, means tasks of collecting, processing, aggregating, and displaying real-time quantitative data about a system.
Why they are important?
- Identify trends
- Compare with previous value
- Predict future value
- Investigate an incident
- Verify expectations
What should be monitored?
To analyze the data, first, you need to extract metrics from your system - like the Memory usage of a particular application instance. We call this extraction instrumentation.
There are different layers where an APM tool should collect data from. The more of them covered, the more insights you’ll get about your system’s behavior.
- Service level
- Host level
- Instance (or process) level
The list you can find below collects the most crucial problems you’ll run into while you maintain a Node.js application in production.
- Service Downtimes
- Error Rate: Because errors are user facing and immediately affect your customers.
- Response time: Because the latency directly affects your customers and business.
- Throughput: The traffic helps you to understand the context of increased error rates and the latency too.
- Saturation: It tells how “full” your service is. If the CPU usage is 90%, can your system handle more traffic?
- Memory Usage: It can be used to recognize a leak.
The difference among logs, traces and metrics
- Logs
- Should be actionable – only log errors that can be acted upon
- Traces
- For investigating what happened in a single request end to end
- Metrics
- For everything else
If something is worth logging, then it’s worth collecting metrics for. Not the case vice versa.
Measure the tasks: how often, how long, payload size, number of task in flight, etc.
Prometheus
Prometheus is an open-source solution for monitoring and alerting. It provides powerful data compressions and fast data querying for time series data.
The core concept of @PrometheusIO is that it stores all data in a time series format.
Time series is a stream of immutable timestamped values that belong to the same metric and the same labels. The labels cause the metrics to be multi-dimensional.
Data collection and metrics types
Prometheus uses the HTTP pull model, which means that every application needs to expose a GET /metrics endpoint that can be periodically fetched by the Prometheus instance.
Prometheus has four metrics types:
- Counter: cumulative metric that represents a single numerical value that only ever goes up
- Reqs/sec
- Errors/sec
- Uploads/min
- Gauge: represents a single numerical value that can arbitrarily go up and down
- Temperature
- Active users
- Speed
- Histogram: samples observations and counts them in configurable buckets
- Durations
- Delays
- Payload Size
- Summary: similar to a histogram, samples observations, it calculates configurable quantiles over a sliding time window
Metrics dimensions
Multiple dimensions are supported in Prometheus with the labels.
Pushgateway
Prometheus offers an alternative, called the Pushgateway to monitor components that cannot be scrapped because they live behind a firewall or are short-lived jobs.
Before a job gets terminated, it can push metrics to this gateway, and Prometheus can scrape the metrics from this gateway later on.
Monitoring an application
When we want to monitor our application with Prometheus, we need to solve the following challenges:
- Instrumentation: Safely instrumenting our code with minimal performance overhead
- Metrics exposition: Exposing our metrics for Prometheus with an HTTP endpoint
- Hosting Prometheus: Having a well configured Prometheus running
- Extracting value: Writing queries that are statistically correct
- Visualizing: Building dashboards and visualizing our queries
- Alerting: Setting up efficient alerts
- Paging: Get notified about alerts with applying escalation policies for paging
Node.js Metrics Exporter
To collect metrics from our Node.js application and expose it to Prometheus we can use the prom-client npm library.
1 | // Init |
1 | // After each response |
1 | // Metrics endpoint |
Queries
Prometheus provides a functional expression language that lets the user select and aggregate time series data in real time.
The Prometheus dashboard has a built-in query and visualization tool.
Alerting
Prometheus comes with a built-in alerting feature where you can use your queries to define your expectations, however, Prometheus alerting doesn’t come with a notification system. To set up one, you need to use the Alert manager or an other external process.
Kubernetes integration
Prometheus offers a built-in Kubernetes integration. It’s capable of discovering Kubernetes resources like Nodes, Services, and Pods while scraping metrics from them.
It’s an extremely powerful feature in a containerized system, where instances born and die all the time. With a use-case like this, HTTP endpoint based scraping would be hard to achieve through manual configuration.
You can also provision Prometheus easily with Kubernetes and Helm.
Grafana
As you can see, the built-in visualization method of Prometheus is great to inspect our queries output, but it’s not configurable enough to use it for dashboards.
As Prometheus has an API to run queries and get data, you can use many external solutions to build dashboards. One of my favorite is Grafana.
Grafana is an open-source, pluggable visualization platform. It can process metrics from many types of systems, and it has built-in Prometheus data source support.