Vol. III · Issue 06 · May 2026Free worldwide PDF + EPUB delivery · No DRMISSN 2814-9921

PDF·EPUB·Lifetime updates

Machine Learning Systems · 1st Edition · March 2026

Serving LLMs in Production

GPUs, batching, and the latency budget you forgot about

5.0(274 ratings)

advanced

384 pages

Notebook demos do not survive contact with production traffic. This book covers GPU memory math, continuous batching, paged attention, KV cache eviction, speculative decoding, and the cost models that determine whether a feature ships green or red. Includes load test results across vLLM, TGI, and TensorRT-LLM at three traffic shapes.

Author

Dr. Priya Anand

ML Platform Engineer

Priya runs the kind of ML platforms where a 200ms regression costs more than your annual cloud bill. Her writing focuses on the boring infrastructure that makes models actually serve traffic.

$26.99

$34.99

Instant PDF + EPUB delivery

DRM-free, copy onto any device

Free chapter updates for the life of the edition

View cart

Specifications

Pages: 384
Edition: 1st Edition
Language: English
Level: advanced
ISBN: 978-1-99999-009-8
Published: March 2026

Editorial review

Reviewed by three working engineers at peer publications before publication. We do not publish first drafts.

Table of contents

What you'll find inside.

01The Latency Budget
02GPU Memory Math, Honestly
03Continuous Batching Demystified
04KV Cache Eviction Strategies
05Speculative Decoding
06Quantization Without Regret
07Multi-tenant Serving
08The Cost Model That Convinces Finance
09Observability for Token Streams

Reader reviews

5.0 / 5

274 verified readers

Verified purchase

Numbers, not vibes

Every other LLM serving guide I read was a vendor pitch in disguise. This one gives real load test data. The GPU math chapter is gold.

Tania Whitfield · ML Infrastructure Engineer

Also in this section

Serving LLMs in Production

What you'll find inside.

5.0 / 5

Numbers, not vibes

More from Machine Learning Systems

Vector Search At Scale