Pillar Guide · 14 min · 8 citations

Token Cost Optimization Playbook

Eight token cost techniques ranked by impact: caching, model routing, embedding-first retrieval, compression, batch, output capping, dedup, fine-tuning.

By AI Biz Hub · Published May 8, 2026 · Updated June 12, 2026

Education · General business information, not legal, tax, or financial advice. Editorial standards Sponsor disclosure Corrections

TL;DR

A token cost optimization playbook is a ranked list of techniques that reduce per-request inference cost without measurable quality regression. The eight techniques in this playbook stack: prompt caching, model routing cascade, embedding-first retrieval, context compression, batch API, output-length capping, semantic deduplication, and fine-tuning small models. Each one targets a different cost driver; combining them produces 70-95% cost reduction on most production AI workloads.

The single largest impact technique is prompt caching at 60-80% reduction on stable-prefix workloads. The second is model routing (90%+ reduction by sending easy queries to cheap models). The remaining techniques are smaller individually but compound. A solo AI founder spending $2,000/month on inference can typically reduce that to $300-$600/month with two weeks of focused optimization, with no measurable quality impact.

Token cost optimization is the highest-ROI engineering work in production AI. The techniques are well-documented, the dollar savings are measurable per request, and the implementation cost is days, not months. This playbook ranks the eight techniques by impact on a typical production workload, with the math, the worked examples, and the failure modes to avoid.

1. The eight techniques, ranked

Ranked by typical cost reduction on a production AI product with a stable workload pattern:

Rank   Technique                       Typical reduction    Implementation effort
1      Prompt caching                  60-80%               Hours
2      Model routing cascade           70-90%               Days
3      Embedding-first retrieval       40-70%               Days
4      Context compression             20-40%               Hours
5      Batch API                       50%                  Hours
6      Output-length capping           10-30%               Hours
7      Semantic deduplication          5-25%                Days
8      Fine-tuning small models        50-90%               Weeks

The percentage ranges describe the savings on the workload component the technique targets. Combining all eight on a single workload produces the multiplicative effect — typically 75-95% total cost reduction on production AI products that started without any optimization.

The right order to implement them depends on your workload. For chat-style products with stable system prompts, start with caching. For multi-route products with varying difficulty, start with routing. For RAG-style products with retrievals, start with embedding-first. Most workloads benefit from at least 4 of the 8 techniques.

2. Technique 1: Prompt caching

Prompt caching reuses input-token activations across requests with the same prefix^[1]. Anthropic charges 0.10x the base rate for cache reads and 1.25x for the initial write; OpenAI charges 0.50x for cache reads with no separate write premium^[4].

The math. For a workload with a 25,000-token stable system prompt reused 10 times per session, caching produces approximately 80% input-token cost reduction at typical hit rates of 70-90%. The break-even point is 2 reads at Anthropic's rates and 2 reads at OpenAI's rates.

Implementation. Identify the largest stable prefix in each route (system prompt + few-shot examples + retrieved documents typically). Place the cache breakpoint after the stable prefix, before any variable content (user input, dates, session IDs). Deploy and measure hit rate using cache_read_input_tokens in the API response. Tune the breakpoint until hit rate is above 70%.

Failure modes. Prefix below the minimum size (1,024 tokens for Sonnet, 2,048 for Haiku) silently disables caching. Variable content above the breakpoint produces 0% hit rate. Frequent model-version changes invalidate cache entries.

3. Technique 2: Model routing cascade

Model routing sends each request to the cheapest model that handles it well. The implementation: a router classifies the request difficulty (often using a small classifier model or rule-based logic), then routes to either a cheap, workhorse, or frontier model.

The math. A workload that uses Claude Sonnet 4.6 ($3/M input, $15/M output) for 100% of requests can typically be split 60/30/10 across Haiku 4.5 / Sonnet 4.6 / Opus 4.8 with no measurable quality loss on the easy 60%. Take a workload of 1B input + 200M output tokens^[5]:

Workload share          Input cost    Output cost    Subtotal
Sonnet 4.6 only (100%)   $3,000        $3,000         $6,000

Cascade:
- 60% Haiku 4.5          $600          $600           $1,200
- 30% Sonnet 4.6         $900          $900           $1,800
- 10% Opus 4.8           $500          $500           $1,000
Cascade total                                         $4,000  (33% reduction)

The skew matters. A near-even 60/30/10 split lands around 33% because the workhorse and flagship shares still dominate. The reduction grows fast as difficulty skews toward the easy end: a more typical 80/18/2 split (most queries are easy) reaches roughly 50%, and an even cheaper bottom tier (GPT-5.4-nano at $0.20/$1.25 for trivial classification) pushes it further. Most production workloads are heavily skewed — 70-85% of queries are easy classification, simple extraction, or short generation that runs fine on Haiku 4.5 or GPT-5.4-mini at 3-6x lower cost than the workhorse tier.

Implementation. Build a difficulty classifier (a small model or a rule-based heuristic). Tag each request with its tier. Route to the cheapest model that meets a quality threshold; escalate to higher tiers only on failure. Measure quality on a held-out eval set per tier; adjust the routing rules until quality holds.

4. Technique 3: Embedding-first retrieval

Embedding-first retrieval reduces the input token count by retrieving only the relevant context instead of stuffing everything into the prompt. For RAG and long-context applications, this is the highest-impact technique after caching.

The math. A documentation Q&A product that includes a full 50,000-token document in every prompt costs $150 per 1,000 queries on Sonnet. The same product using embedding-first retrieval to extract the top 5 most relevant 500-token chunks per query costs $7.50 per 1,000 queries, a 95% reduction on input cost. Embedding cost (OpenAI text-embedding-3-small at $0.02/M tokens^[6]) is negligible.

Implementation. Embed the document corpus once; store in pgvector or Pinecone. For each query, embed the query, retrieve the top K (typically 3-10) chunks, include only those chunks in the prompt. Quality typically holds or improves because the model sees less irrelevant content.

Failure modes. Retrieval that surfaces wrong or partial chunks produces worse outputs than including everything. The fix is to invest in retrieval quality — chunking strategy, embedding model choice, retrieval diversity (MMR re-ranking). Embedding-first retrieval works only if the retrieval quality is good; bad retrieval saves money but loses customers.

5. Technique 4: Context compression

Context compression reduces the size of in-prompt content by summarizing, removing redundancy, or applying domain-specific shrinking. Common patterns:

Conversation summarization. Long chat histories are summarized down to the last 2-3 turns plus a running summary. Saves 60-90% of tokens on long-running conversations.
Code snippet compression. For coding workflows, comments are stripped, imports are condensed, repeated patterns are referenced. Saves 20-40% in code-heavy contexts.
Document section selection. Long documents are pre-processed to extract only the relevant sections. Different from retrieval, this is rule-based selection within a single document. Saves 50-80% on structured documents.
JSON shrinking. Verbose JSON output is replaced with shorter formats (CSV, YAML, abbreviated keys) where the model output is consumed by downstream code. Saves 30-50% on tool-using workflows.

Implementation. Identify the largest variable-content blocks in your prompts. Build a compression step (rule-based or model-based) that runs before the main inference. Measure quality on the same eval set used for routing.

Pattern note. The compression step itself usually runs on a cheap model (Haiku 4.5 or GPT-5.4-mini). Compression cost is typically 5-15% of the savings produced — net win in nearly all cases.

6. Technique 5: Batch API

Batch APIs from Anthropic and OpenAI offer 50% off in exchange for accepting up to 24-hour latency^[2]^[3]. The discount applies to both input and output tokens.

The math. For workloads that do not need real-time response — overnight document processing, weekly report generation, periodic embeddings, training data generation, the batch API produces a flat 50% reduction. For interactive workloads, batch is unavailable.

Implementation. Identify which workloads are batch-eligible. Typical candidates: scheduled summaries, overnight content generation, weekly analytics processing, training data preparation. Submit batches via the provider's batch endpoint; receive results within the SLA window.

Failure modes. Batch jobs occasionally fail or are returned incomplete. Production batch processing requires retry logic and partial-result handling. The 24-hour SLA is a maximum, not a typical; most batches complete in 1-4 hours.

7. Technique 6: Output-length capping

Output tokens cost 4-5x what input tokens cost (Claude Sonnet at $3/M input vs $15/M output^[5]). Capping output length where possible is a direct cost saving with limited quality impact for many workloads.

Patterns:

Set max_tokens conservatively per route. A summarization route does not need 4,000 tokens of output; cap at 500. A classification route needs maybe 50 tokens of output; cap there.
Prompt for brevity. "Respond in 100 words or less" or "Respond in JSON only, no explanation" reliably reduces output by 30-50% on most models.
Use structured outputs. JSON-schema outputs are typically 30-50% shorter than free-form prose for the same information.
Strip post-generation. Some routes generate output that is trimmed to a smaller portion before display (e.g., extract one field from a JSON response). Move the extraction into the prompt to avoid generating the discarded fields.

Output capping is the most-skipped technique because the savings per request are small. The cumulative effect at production scale is meaningful — 10-30% on output costs across a workload.

8. Technique 7: Semantic deduplication

Semantic deduplication identifies near-duplicate requests and serves them from a cache rather than re-querying the model. The cache is keyed by semantic similarity (embedding distance) rather than exact string match.

The math. For workloads with significant query repetition (FAQ-style assistants, common-question tools, search products with popular queries), 5-25% of requests are near-duplicates of recent prior requests. Deduplication serves these from cache at near-zero cost.

Implementation. Embed each incoming query. Check the recent-cache for queries within a similarity threshold (typically cosine similarity > 0.92-0.96). If a match exists, return the cached response. Cache TTL is workload-dependent — minutes for fast-changing data, days for stable Q&A content.

Failure modes. Two semantically-similar queries may have different correct answers (e.g., "show me sales for Q1" vs "show me sales for Q2"). Tune the similarity threshold per workload; lower thresholds save more but produce more wrong-answer cache hits.

9. Technique 8: Fine-tuning small models

Fine-tuning a small model on your specific task replaces frontier-model inference with a smaller fine-tuned model that handles the same task at a far lower per-token cost^[8].

One caveat before the math: OpenAI is winding down its fine-tuning platform and has closed it to new users^[8], so as of 2026 the practical route for a solo founder is fine-tuning an open-weight small model on a hosted provider (Together, Fireworks, or self-host), not GPT fine-tunes.

The math. A classification task running on a workhorse model like Claude Sonnet 4.6 ($3/M input, $15/M output) can typically be replaced by a fine-tuned open-weight small model at roughly $0.20-$0.60/M input and $0.20-$0.80/M output, at quality comparable to the workhorse on that narrow task. That is on the order of a 10-20x cut on the classification volume — verify against your provider's current rate, because the multiple swings with both the base model you replace and the small model you land on.

Fine-tuning cost. Training cost is typically $50-$500 for a 5,000-10,000 example dataset. Hosted inference cost on the fine-tuned model is roughly 1.5-2x the base small-model rate. Break-even on the training cost happens at relatively low volumes — 100k-500k requests typically.

Implementation. Generate a high-quality training set from current frontier-model outputs (5,000-10,000 examples is typically sufficient). Fine-tune the small model. Evaluate against held-out test set. Route the specific task to the fine-tuned model; keep frontier model only for the long-tail of difficult cases.

Failure modes. Fine-tuned models drift from base capability — they may handle the trained task well but fail on adjacent variations. Fine-tuning is best for narrow, well-defined tasks (classification, extraction, format conversion) and worst for open-ended generation. Re-train periodically as task distribution shifts.

10. The combined effect: a worked example

A solo founder runs an AI documentation assistant on Claude Sonnet 4.6. Monthly stats:

200,000 requests per month
Average 30,000 input tokens per request (system prompt + few-shot + retrieved docs)
Average 500 output tokens per request
Baseline cost: 6B input * $3/M + 100M output * $15/M = $18,000 + $1,500 = $19,500/month

Applying the playbook:

Step                              Cost            Reduction
Baseline                          $19,500/mo      -
+ Prompt caching (75% hit rate)   $7,800/mo       60% reduction on input
+ Model routing (60% Haiku)       $4,200/mo       46% reduction (compounding)
+ Embedding-first retrieval       $2,800/mo       33% reduction (smaller context)
+ Context compression             $2,200/mo       21% reduction
+ Output capping (avg 250 tok)    $1,800/mo       18% reduction
+ Semantic dedup (15% hits)       $1,550/mo       14% reduction
+ Batch API on overnight 30%      $1,300/mo       16% reduction
+ Fine-tuned classifier on easy   $850/mo         35% reduction
TOTAL                              $850/mo         96% total reduction

The cumulative effect of all eight techniques on this workload is a 96% cost reduction — from $19,500/month to roughly $850/month. The implementation cost is approximately 4-6 weeks of one engineer's time spread across the techniques. Payback is roughly 2-4 weeks of operating cost.

Not every workload achieves 96%. Workloads without stable prefixes do not benefit from caching. Workloads without difficulty variance do not benefit from routing. Workloads with strict latency requirements cannot use batch API. The realistic floor for production AI workloads with no optimization is 50-70% cost reduction; the realistic ceiling for fully-optimized workloads is 90-97%.

Two implementation principles. First, measure before each step. The percentage reductions above are typical, not guaranteed; some workloads benefit more or less. Track input cost, output cost, cache hit rate, and per-tier model usage weekly. Second, never sacrifice quality for cost. Run the same eval suite against every change; if quality regresses, the change is not worth the savings. Cost optimization that loses customers is more expensive than the cost it saves. Token cost optimization is engineering, not magic, the techniques are well-documented and the dollar math is verifiable. Solo founders who treat inference cost as a managed engineering surface, not as a fixed bill from the vendor, run AI products at margins that compete with traditional SaaS instead of being squeezed by foundation-model economics.

Frequently asked questions

How can I reduce my LLM token costs?

Stack eight techniques, each targeting a different cost driver: prompt caching, a model-routing cascade, embedding-first retrieval, context compression, the batch API, output-length capping, semantic deduplication, and fine-tuning small models. Combined, they produce a 70 to 95 percent cost reduction on most production AI workloads with no measurable quality regression. A solo founder spending $2,000 a month on inference can typically get to $300 to $600 a month with about two weeks of focused optimization.

Which token-optimization technique has the biggest impact?

Prompt caching, at a 60 to 80 percent reduction on stable-prefix workloads, is the single largest lever. Model routing is second — sending easy queries to a cheap model and reserving the expensive model for hard ones delivers 90-plus percent reduction on the routed share. The remaining six techniques (embedding-first retrieval, context compression, batch API, output capping, semantic dedup, fine-tuning) are smaller individually but compound on top of the first two.

Does cutting token costs hurt output quality?

Done correctly, no — the goal of the playbook is reducing per-request inference cost without measurable quality regression. Each technique targets cost, not capability: caching reuses an identical prefix, routing only sends easy queries to cheaper models, output capping trims verbosity that adds no value, and fine-tuning a small model on your task can match a larger model on that narrow task. The discipline is to validate each change against a held-out eval set so any quality impact is measured rather than assumed before it ships.

How does the batch API cut LLM costs?

Both Anthropic and OpenAI offer a 50 percent discount on batch processing in exchange for a 24-hour turnaround SLA. For any non-time-sensitive bulk job — classification, document processing, backfills, offline enrichment — the batch rate is the number to budget on rather than the standard synchronous rate, halving the cost of that share of the workload. It only applies where latency does not matter, so it stacks with caching and routing on the asynchronous portion of a product's inference.

References

Sources

Primary sources only. No vendor-marketing blogs or aggregated secondary claims.

1 Anthropic — Prompt caching documentation (mechanics, pricing, TTL) — accessed 2026-05-08
2 Anthropic — Message Batches API documentation (50% off, 24h SLA) — accessed 2026-05-08
3 OpenAI — Batch API documentation (50% discount, 24h turnaround) — accessed 2026-05-08
4 OpenAI — Prompt caching announcement (50% discount on cached input) — accessed 2026-05-08
5 Anthropic — API pricing (per-million-token rates including cache) — accessed 2026-06-12
6 OpenAI — API pricing (per-million-token rates and embedding costs) — accessed 2026-06-12
7 Google AI — Gemini context caching pricing and TTL behavior — accessed 2026-05-08
8 OpenAI — Model optimization / fine-tuning guide (platform winding down, closed to new users) — accessed 2026-06-12

Tools referenced in this article

Plan Your Build

AI Stack Cost Calculator

Estimate your full AI app stack cost at different user scales — hosting, DB, auth, AI API, and services.

Run the Numbers

AI Product Margin Calculator

Calculate per-user margin for AI products from subscription price, API token costs, hosting, and per-user expenses.

Plan Your Build

Embeddings DB Cost

Pinecone, Postgres+pgvector, LanceDB, or Turbopuffer — cheapest for your workload.

Run the Numbers

AI vs Human Support Cost

Compare AI-first and human-only support cost with token spend and escalation overhead.

12 min

Prompt Caching ROI: When It Pays Off, When It Does

Anthropic prompt caching mechanics, hit-rate sensitivity, and the break-even derivation. Worked example: 61-84% input cost reduction on a docs assistant.