Vol. III · Issue 06 · May 2026ISSN 2814-9921
Serving LLMs in Production
PDF·EPUB·Lifetime updates
Machine Learning Systems · 1st Edition · March 2026

Serving LLMs in Production

GPUs, batching, and the latency budget you forgot about

5.0(274 ratings)
advanced
384 pages

Notebook demos do not survive contact with production traffic. This book covers GPU memory math, continuous batching, paged attention, KV cache eviction, speculative decoding, and the cost models that determine whether a feature ships green or red. Includes load test results across vLLM, TGI, and TensorRT-LLM at three traffic shapes.

Dr. Priya Anand
Author
Dr. Priya Anand
ML Platform Engineer

Priya runs the kind of ML platforms where a 200ms regression costs more than your annual cloud bill. Her writing focuses on the boring infrastructure that makes models actually serve traffic.

$26.99
$34.99
Instant PDF + EPUB delivery
DRM-free, copy onto any device
Free chapter updates for the life of the edition
View cart
Specifications
Pages
384
Edition
1st Edition
Language
English
Level
advanced
ISBN
978-1-99999-009-8
Published
March 2026
Editorial review

Reviewed by three working engineers at peer publications before publication. We do not publish first drafts.

Table of contents

What you'll find inside.

  1. 01The Latency Budget
  2. 02GPU Memory Math, Honestly
  3. 03Continuous Batching Demystified
  4. 04KV Cache Eviction Strategies
  5. 05Speculative Decoding
  6. 06Quantization Without Regret
  7. 07Multi-tenant Serving
  8. 08The Cost Model That Convinces Finance
  9. 09Observability for Token Streams
Reader reviews

5.0 / 5

274 verified readers

Verified purchase

Numbers, not vibes

Every other LLM serving guide I read was a vendor pitch in disguise. This one gives real load test data. The GPU math chapter is gold.

Tania Whitfield · ML Infrastructure Engineer
Also in this section

More from Machine Learning Systems