Embracing Failure: Lessons learned from 50 post-mortems

July 20, 2021
4 min read

Embracing Failure: Lessons learned from 50 post-mortems

Henning Jacob of Zalando fame has started compiling an awesome list of failure stories at k8s.af (And yes failure stories are awesome!)

With 42 incidents and counting it is starting to provide a good dataset for identifying common contributing factors to outages.

Note that we use contributing factors and not root cause. Decades of resilience engineering research has found that linear causality in complex adaptive systems is a widely held fallacy.

Contributing Factors


Raw data (skewed slightly by the large number of Zalando outages reported)

Resource Exhaustion

Resources are the most contributing factor in bringing down kubernetes clusters.

  • Increase the number of fault-domains to limit the amount of over-provisioning that is required
  • Use nodeAffinity and/or tainting to ensure your node pool's have enough capacity
  • Enforce Resource Quotas to restrict both the total CPU/Memory and the number of Pods, LoadBalancers that each namespace can create.
  • Monitor CronJobs and/or isolate them into their own namespaces. If not configured correctly they can quickly runaway and consume all available POD resources.
  • When running in the cloud be wary of API Rate Limiting on dependencies such as IAM and Container Registries

Overprovisioning Nodes

If you don't specify any requests than Kubernetes will schedule potentially more PODs than it has resources for, triggering OutOfMemory errors and/or CPU exhaustion. When this happens, the kubelet will also potentially start reporting unhealthy, triggering kubernetes to evict all pods and reschedule on different nodes, causing a cascading failure than can bring down the entire cluster.

  • Always use CPU / Memory requests
  • Don't over-provision memory
  • Consider skipping CPU limits as they come with an overhead due to kernel based throttling
This is why I always advise:

1) Always set memory limit == request
2) Never set CPU limit

(for locally adjusted values of "always" and "never")— Tim Hockin (@thockin)
May 30, 2019

DNS

As the central component used in service discovery, DNS issues can affect the entire cluster

  • Be wary of the ndots:5 issue - it is easy to DDoS yourself
  • Random 2 – 5s latencies are often DNS related
  • Give CoreDNS extra CPU and Memory, and increase the number of replicas as needed
  • Consider Node Local DNS for DNS caching and reducing conntrack usage

Upgrades

Outages that occur during or after upgrades feature regularly - In particular to note time-bombs can be introduced during an upgrade that only present them at the worst possible moment (restarting your infrastructure in an attempt to mitigate a partial outage, triggering a full outage)

  • Be wary of long running pods and services, consider implementing a policy of terminating pods after a set period
  • Test both new deployments and upgrades from previous versions

Networking/Ingress

  • Understand zero-down time deployments (it is not as easy as you may think)
  • Monitor your network control plane, especially for resource exhaustion and capacity limits

Application

Incorrectly configured applications can quickly lead to issues that cascade into cluster-wide failure.

  • Use policies to limit developer defined configuration to sensible values (see OPA)
  • Enforce the use of liveness and readiness probes
  • Restrict replicas, and constrain deployment strategies budgets for
  • Ensure CronJob's are correctly configured
  • Avoid custom scheduling using node taints
  • Apply default podAntiAffinity's

Similar posts

With over 2,400 apps available in the Slack App Directory.
No items found.