Embracing Failure: Lessons learned from 50 post-mortems
July 20, 2021
4 min read
Embracing Failure: Lessons learned from 50 post-mortems
Henning Jacob of Zalando fame has started compiling an awesome list of failure stories at k8s.af (And yes failure stories are awesome!)
With 42 incidents and counting it is starting to provide a good dataset for identifying common contributing factors to outages.
Note that we use contributing factors and not root cause. Decades of resilience engineering research has found that linear causality in complex adaptive systems is a widely held fallacy.
Contributing Factors
Raw data (skewed slightly by the large number of Zalando outages reported)
Resource Exhaustion
Resources are the most contributing factor in bringing down kubernetes clusters.
Increase the number of fault-domains to limit the amount of over-provisioning that is required
Use nodeAffinity and/or tainting to ensure your node pool's have enough capacity
Enforce Resource Quotas to restrict both the total CPU/Memory and the number of Pods, LoadBalancers that each namespace can create.
Monitor CronJobs and/or isolate them into their own namespaces. If not configured correctly they can quickly runaway and consume all available POD resources.
When running in the cloud be wary of API Rate Limiting on dependencies such as IAM and Container Registries
Overprovisioning Nodes
If you don't specify any requests than Kubernetes will schedule potentially more PODs than it has resources for, triggering OutOfMemory errors and/or CPU exhaustion. When this happens, the kubelet will also potentially start reporting unhealthy, triggering kubernetes to evict all pods and reschedule on different nodes, causing a cascading failure than can bring down the entire cluster.
Always use CPU / Memory requests
Don't over-provision memory
Consider skipping CPU limits as they come with an overhead due to kernel based throttling
This is why I always advise:
1) Always set memory limit == request 2) Never set CPU limit
(for locally adjusted values of "always" and "never")— Tim Hockin (@thockin) May 30, 2019
DNS
As the central component used in service discovery, DNS issues can affect the entire cluster
Be wary of the ndots:5 issue - it is easy to DDoS yourself
Random 2 – 5s latencies are often DNS related
Give CoreDNS extra CPU and Memory, and increase the number of replicas as needed
Consider Node Local DNS for DNS caching and reducing conntrack usage
Upgrades
Outages that occur during or after upgrades feature regularly - In particular to note time-bombs can be introduced during an upgrade that only present them at the worst possible moment (restarting your infrastructure in an attempt to mitigate a partial outage, triggering a full outage)
Be wary of long running pods and services, consider implementing a policy of terminating pods after a set period
Test both new deployments and upgrades from previous versions