When people hear ‘microservices’ they often think about Kubernetes, which is a declarative container orchestrator. Because of its declarative nature, Kubernetes treats microservices as entities, which presents some challenges when it comes to troubleshooting. Let’s take a look at why troubleshooting microservices in a Kubernetes environment can be challenging, and some best practices for getting it right.
To understand why troubleshooting microservices can be challenging, let’s look at an example. If you have an application in Kubernetes, you can deploy it as a pod and leverage Kubernetes to scale it. The entity is a pod that you can monitor. With microservices, you shouldn’t monitor pods; instead, you should monitor services. So you can have a monolithic workload (a single container deployed as a pod) and monitor it, but if you have a service made up of several different pods, you need to understand the interactions between those pods to understand how the service is behaving. If you don’t do that, what you think is an event might not really be an event (i.e. might not be material to the functioning of the service).
When it comes to monitoring microservices, you need to monitor at the service level, not the pod level. If you try to monitor at the pod level, you’ll be fighting with the orchestrator and might get it wrong. I recognize that “You should not be monitoring pods” is a bold statement, but I believe that if you’re doing that, you won’t get it right the majority of the time.
Common sources of issues when troubleshooting microservices
Network, infrastructure, and application issues are all commonly seen when troubleshooting microservices.
Network
Issues at the network level are the hardest ones to debug. If the problem is in the network, you need to look at socket-layer stats. The underlying network has sockets that connect point A to B, so you need to look at round-trip time at the network level, see if packets are being transmitted, if there’s a routing issue, etc.
Infrastructure
One way infrastructure issues can manifest is as pod restarts (crash looping in Kubernetes). This can happen for many reasons. For example, if you have a pod in your service that can’t reach the Kubernetes data store, Kubernetes will restart it. You need to track the status of the pods that are backing the service. If you see several or frequent pod restarts, it becomes an issue.
Another common infrastructure issue is the Kubernetes API server being overloaded and taking a long time to respond. Every time something needs to happen, pods need to talk to the API server—so if it’s overloaded, it becomes an issue.
A third infrastructure issue is related to the Domain Name System (DNS). In Kubernetes, your services are identified by names, which get resolved with a DNS server. If those resolutions are slow, you start to see issues.
Application
There are several common application issues that can lead to restarts and errors. For example, if your service load balancing isn’t happening, say because there’s a change in your URL or the load balancer isn’t doing something right, you could be overloading a single pod and causing it to restart.
If your URLs are not constructed properly, you’ll get a response code “404 page not found.” If the server is overloaded, you’ll get a 500 error. These are application issues that manifest as infrastructure issues.
Best practices for troubleshooting microservices
Here are two best practices for effectively identifying and troubleshooting microservice issues.
1. Aggregate data at the service level
You need to use a tool that provides data (i.e. a log) that is aggregated at the service level, so you can see how many pod restarts, error codes, etc. occurred. This is different from the approach most DevOps engineers use today, where every pod restart is a separate alert, leading engineers to be buried in alerts that might just be normal operations or Kubernetes correcting itself.
Some DevOps engineers might wonder if service mesh can be used to aggregate data in this way. While service mesh has observability tools baked in, you need to be careful because many service meshes sample due to the large amount of data involved; they provide you raw data and give you labels to aggregate the data yourself. What you really need is a tool that gives you just the data you need for the service, as well as service-level reporting.
2. Use machine learning
When trying to identify and troubleshoot microservice issues, you need to monitor how each pod belonging to your service is behaving. This means monitoring metrics like latency, number of process restarts, and network connection errors. There are two ways to do this:
Set a threshold — For example, if there are more than 20 errors, create an alert. This is a bit of a naive approach in a dynamic system like Kubernetes, particularly with microservices.
Baselining — Use machine learning to study how a metric behaves over time, and build a machine learning model to predict how that metric will behave in the future. If the metric deviates from its baseline, you will receive an alert specifying which parameters led the machine learning algorithm to believe there was an issue.
I advise against trying to set a threshold—you’ll be flooded with alerts and this will cause alert fatigue. Instead, use machine learning. Over time, a machine learning algorithm can start to alert you before an issue arises.