喵学堂 · PurrLearn — 零基础学任何东西

❤️ 15/15

Foundation: inference has two phases, and output tokens dominate latency and cost

The first step in cost / latency engineering is understanding that LLM generation happens in two phases (NVIDIA's inference optimization guide):

- Prefill — processes the entire prompt in parallel in one shot and computes the first token; it's matrix-matrix math that can saturate the GPU, so it's compute-bound, and it determines time to first token (TTFT).
- Decode — emits tokens one at a time autoregressively; each step is matrix-vector math with low compute utilization, and the bottleneck is the bandwidth of moving weights and KV from memory into the GPU, so it's memory-bandwidth-bound and determines the inter-token latency (ITL).

This divide is the foundation for every optimization that follows: shortening the prompt mainly saves prefill / TTFT, while shortening the output mainly saves decode / total time.

🔆Think of inference as cooking. Prefill is throwing all the ingredients into the wok at once and stir-frying at full heat (compute) — the first bite (first token) comes out fast. Decode is then serving it out one spoonful at a time; the bottleneck isn't the stove but how fast you run back and forth to the pantry (memory) for more. So the longer the dish (output), the more pantry trips, and the longer it takes overall.

From this comes a counterintuitive but very practical conclusion (Anthropic's latency docs + NVIDIA): output tokens are usually priced higher than input tokens, and because decode is a token-by-token, memory-bound serial process, generation length tends to dominate end-to-end total time — generating fewer tokens affects latency more than feeding fewer tokens.

Anthropic's three official latency-reduction moves map directly onto this:

1. Pick the right model — use the faster Claude Haiku for speed-critical cases;
2. Optimize the prompt and output length — keep the model concise and use max_tokens as a hard cap;
3. Use streaming — get the first token out as early as possible to improve perceived responsiveness.

One subtle pitfall: asking for limits by sentence / paragraph is more effective than by exact word count, because the model counts by token, not by word.

📷 Image placeholder: a horizontal timeline — the left block labeled "Prefill (compute-bound, sets TTFT)" processing the whole prompt in parallel, the right block labeled "Decode (memory-bound, sets ITL/total time)" emitting small boxes one by one serially, with an arrow noting "the longer the output, the longer this segment"