The Complete Guide to Monitoring on Kubernetes

The complete guide to monitoring and alerting.png

At stack.io, we believe the more intentional monitoring, the better. Considering that space is fairly cheap, it is worth monitoring everything that you can get your hands on (with a purpose). This will ideally create ample logs for you to use when troubleshooting an issue.

Today, we’re going to talk about the monitoring on Kubernetes: why you need it, how to perform it, and what metrics to pay attention to.

Why you need monitoring on Kubernetes

Confirms changes/upgrades

Monitoring allows us to confirm that our changes or upgrades to an environment are correct and compatible with the rest of the environment. A thorough monitoring setup will alert us if something has gone awry and (hopefully) prevent a disaster from occurring.

Provides a simple view of a complex system

Unlike a virtual machine, where you can easily run a few commands in your command line and get a clear view of the metrics, Kubernetes does not offer this natively. However, with monitoring, you can easily trace the system and make sure it is behaving as expected. 

Monitoring on Kubernetes can allow you to easily observe the performance of your containers or deployments, your resource usage, and when it’s peak time.

Notifies you of issues

Kubernetes by itself doesn’t alert you of issues. There is an option for you to check components individually through the command line, but at the end of the day, it’s a much better practice to automate this process. As such, effective monitoring should lend to custom alerts based on your needs. Not only are you more likely to catch errors earlier on, but it’s also a better way for all team members to understand what’s going on at all times. 

In addition, a well-combined monitoring and alerting system makes it easier to diagnose the issue during outages since it would be easier to find what’s wrong or rule out something as the cause for an outage.

Saves money

Of course, monitoring isn’t only about avoiding issues or troubleshooting. Monitoring can also effectively be applied to proactively improve your infrastructure. One of the more common uses of monitoring is cost optimization. 

Although monitoring itself costs a small amount of money, the resulting resource optimization will likely overcompensate for that cost.

How to perform monitoring on Kubernetes

Prometheus + Grafana 

Your DevOps team can monitor directly on a cloud provider-specific tool like CloudWatch for AWS. A lot of the required information comes natively with these applications, but if you’re looking for a little bit more customization, then you might want to consider Prometheus and Grafana. 

Put simply, Prometheus is an open-source monitoring system and Grafana is an open-source visualization platform. Prometheus can be used to gather metrics from various sources. For example, RDS or Elasticsearch metrics from CloudWatch or resource usage metrics directly from your Kubernetes pods or nodes. Grafana can then interpret the Prometheus format and display the metrics in user-friendly views such as graphs.

Metrics to pay attention to

Response Time

Oftentimes, the high response time (more than a few milliseconds) is an indicator of performance issues. You can configure your monitoring and alerting system to warn you if the average response time for a particular server exceeds the expected threshold.

Resource Usage

Resource usage monitoring is usually more for upkeep or maintenance. You want to make sure that your environment can handle the web application and fix issues before they become issues. For example, you may want to consider:

  • Are pods running out of memory/being Out of Memory (OOM) killed?

  • Is the load average or number of CPU cores on the machine > 1?

  • Is resource usage high on databases?

  • Are any databases or other data storage close to running out of storage space?

  • Are your SSL certificates close to expiry?

  • Are you reaching autoscaling limits (either for Kubernetes nodes or for the number of Kubernetes pods)?

Lack of Metrics

Sometimes a lack of metrics in itself is a metric to pay attention to. An absence of metrics for prolonged periods of time could indicate that monitoring itself has failed. Similarly, no logs ingested recently could indicate that logging infrastructure has stopped working.

Conclusion

Monitoring more intentional metrics will simplify debugging and troubleshooting in the face of crisis. Especially in a complex system like Kubernetes, monitoring can provide a more digestible view of what’s going on internally.

What are some key metrics that your DevOps team pays attention to? Let us know.