The first step in cost / latency engineering is understanding that LLM generation happens in two phases (NVIDIA's inference optimization guide):
- Prefill — processes the entire prompt in parallel in one shot and computes the first token; it's matrix-matrix math that can saturate the GPU, so it's compute-bound, and it determines time to first token (TTFT).
- Decode — emits tokens one at a time autoregressively; each step is matrix-vector math with low compute utilization, and the bottleneck is the bandwidth of moving weights and KV from memory into the GPU, so it's memory-bandwidth-bound and determines the inter-token latency (ITL).
This divide is the foundation for every optimization that follows: shortening the prompt mainly saves prefill / TTFT, while shortening the output mainly saves decode / total time.
From this comes a counterintuitive but very practical conclusion (Anthropic's latency docs + NVIDIA): output tokens are usually priced higher than input tokens, and because decode is a token-by-token, memory-bound serial process, generation length tends to dominate end-to-end total time — generating fewer tokens affects latency more than feeding fewer tokens.
Anthropic's three official latency-reduction moves map directly onto this:
1. Pick the right model — use the faster Claude Haiku for speed-critical cases;
2. Optimize the prompt and output length — keep the model concise and use max_tokens as a hard cap;
3. Use streaming — get the first token out as early as possible to improve perceived responsiveness.
One subtle pitfall: asking for limits by sentence / paragraph is more effective than by exact word count, because the model counts by token, not by word.
📷 Image placeholder: a horizontal timeline — the left block labeled "Prefill (compute-bound, sets TTFT)" processing the whole prompt in parallel, the right block labeled "Decode (memory-bound, sets ITL/total time)" emitting small boxes one by one serially, with an arrow noting "the longer the output, the longer this segment"