Embracing Failure: Lessons learned from 50 post-mortems

Henning Jacob of Zalando fame has started compiling an awesome list of failure stories at k8s.af (And yes failure stories are awesome!)

With 42 incidents and counting it is starting to provide a good dataset for identifying common contributing factors to outages.

Note that we use contributing factors and not root cause. Decades of resilience engineering research has found that linear causality in complex adaptive systems is a widely held fallacy.

Contributing Factors

Raw data (skewed slightly by the large number of Zalando outages reported)

Resource Exhaustion

Resources are the most contributing factor in bringing down kubernetes clusters.

Increase the number of fault-domains to limit the amount of over-provisioning that is required
Use nodeAffinity and/or tainting to ensure your node pool's have enough capacity
Enforce Resource Quotas to restrict both the total CPU/Memory and the number of Pods, LoadBalancers that each namespace can create.
Monitor CronJobs and/or isolate them into their own namespaces. If not configured correctly they can quickly runaway and consume all available POD resources.
When running in the cloud be wary of API Rate Limiting on dependencies such as IAM and Container Registries

Overprovisioning Nodes

If you don't specify any requests than Kubernetes will schedule potentially more PODs than it has resources for, triggering OutOfMemory errors and/or CPU exhaustion. When this happens, the kubelet will also potentially start reporting unhealthy, triggering kubernetes to evict all pods and reschedule on different nodes, causing a cascading failure than can bring down the entire cluster.

Always use CPU / Memory requests
Don't over-provision memory
Consider skipping CPU limits as they come with an overhead due to kernel based throttling

This is why I always advise:

1) Always set memory limit == request
2) Never set CPU limit

(for locally adjusted values of "always" and "never")— Tim Hockin (@thockin) May 30, 2019

DNS

As the central component used in service discovery, DNS issues can affect the entire cluster

Be wary of the ndots:5 issue - it is easy to DDoS yourself
Random 2 – 5s latencies are often DNS related
Give CoreDNS extra CPU and Memory, and increase the number of replicas as needed
Consider Node Local DNS for DNS caching and reducing conntrack usage

Upgrades

Outages that occur during or after upgrades feature regularly - In particular to note time-bombs can be introduced during an upgrade that only present them at the worst possible moment (restarting your infrastructure in an attempt to mitigate a partial outage, triggering a full outage)

Be wary of long running pods and services, consider implementing a policy of terminating pods after a set period
Test both new deployments and upgrades from previous versions

Networking/Ingress

Understand zero-down time deployments (it is not as easy as you may think)
Monitor your network control plane, especially for resource exhaustion and capacity limits

Application

Incorrectly configured applications can quickly lead to issues that cascade into cluster-wide failure.

Use policies to limit developer defined configuration to sensible values (see OPA)
Enforce the use of liveness and readiness probes
Restrict replicas, and constrain deployment strategies budgets for
Ensure CronJob's are correctly configured
Avoid custom scheduling using node taints
Apply default podAntiAffinity's

Embracing Failure: Lessons learned from 50 post-mortems

Embracing Failure: Lessons learned from 50 post-mortems

Contributing Factors

Resource Exhaustion

Overprovisioning Nodes

DNS

Upgrades

Networking/Ingress

Application

Moshe Immerman

Similar posts