The Kubernetes On-Call Handbook
Production playbooks for engineers carrying the pager
Most Kubernetes books optimize for tutorials. This one optimizes for the 3am page. You will learn how to triage a wedged control plane, decode opaque CNI failures, untangle ETCD performance regressions, and write incident-grade runbooks that other on-call engineers will actually use. Every chapter ends with a real anonymized postmortem from production environments running between 200 and 50,000 nodes.
Mariana spent the last decade running on-call for trading systems and global edge networks. Her work focuses on the boring failures nobody writes blog posts about: clock skew, partial partitions, and the contract between SREs and the language runtime.
- Pages
- 412
- Edition
- 2nd Edition
- Language
- English
- Level
- advanced
- ISBN
- 978-1-99999-001-2
- Published
- March 2026
Reviewed by three working engineers at peer publications before publication. We do not publish first drafts.
What you'll find inside.
- 01Why Kubernetes Pages You
- 02The Anatomy of an etcd Failure
- 03CNI Forensics: Calico, Cilium, Flannel
- 04Control Plane Recovery Drills
- 05Cluster Autoscaler Pathologies
- 06Stateful Workloads You Cannot Drain
- 07Observability for the Cluster, Not the Pod
- 08Writing Runbooks Engineers Trust
- 09Postmortems Without the Blame
- 10Building an On-Call Culture That Sustains
5.0 / 5
187 verified readers
Made my team's on-call rotation calmer
Distributed the postmortem chapter to my whole team. It changed the temperature of our incident reviews within two weeks.
Worth the price on chapter 3 alone
I have read every K8s book published in the last six years. This one is the first that admits ETCD will eventually be your problem and tells you exactly what to do about it. Chapter 3 paid for the book inside a week.