Token math, in plain English
One token ≈ 0.75 English words, ≈ 4 characters. A 1,000-word doc ≈ 1,300 tokens. A 'long' chat reply ≈ 500–800 tokens. Vendors price per 1,000,000 tokens — so $2.50/1M means $0.0025 per 1,000 tokens, or $0.000025 per 1k chars. Multiply by request volume and the rounding error becomes a real budget line fast.
- • Input is what you send (system + user + retrieved context + tool definitions).
- • Output is what the model generates (the assistant message + any tool args).
- • Tool/function-calling tokens count as output. Long JSON tool schemas count as input on every call.
Why your bill explodes faster than your traffic
LLM bills scale with tokens × requests, not just requests. Three things compound: (1) context bloat — devs add 'just one more example' to the system prompt and quietly 4× input tokens; (2) RAG retrieval — pulling top-20 chunks instead of top-3 multiplies input cost without measurably better answers; (3) reasoning models — o1/o3/Gemini Thinking burn thousands of hidden reasoning tokens billed as output. Track $/successful-task as your north star, not $/token.

Most teams discover their LLM bill three months after shipping, when finance forwards a $40k AWS-style invoice and asks what happened. The honest answer is almost always the same: nobody modeled cost per request before launch. This calculator does that math in 30 seconds — input tokens, output tokens, cache hit rate, and the retail price you charge — and tells you whether your unit economics survive contact with real traffic.
What each input means
Get these inputs right and the output is reliable. Get them wrong and the calculator just multiplies bad assumptions.
Requests per day
Total inference calls hitting the API, including retries and background jobs.
Typical range: 1k–10k for early SaaS; 100k–1M for consumer apps; 10M+ for embedded classifiers.
Avg input tokens / request
System prompt + user message + retrieved RAG chunks + tool/function definitions. Use a real distribution, not a guess.
Typical range: 500–2,000 for chat; 4k–12k for RAG; 200–600 for classification.
Avg output tokens / request
Assistant reply + any tool-call arguments. Reasoning models burn 2–10x more invisible tokens.
Typical range: 100–500 for chat; 500–2,000 for code/long-form; 4k+ for o1/o3/Thinking models.
Input / output price ($/1M)
Vendor pricing per million tokens. Output is almost always 3–5x input.
Typical range: Frontier: $2.50/$10 (GPT-4o), $3/$15 (Sonnet). Mini-tier: $0.15/$0.60 (Haiku, Mini, Flash).
Prompt cache hit %
Share of input tokens served from prompt cache. Big shared system prompts benefit most; one-off requests don't.
Typical range: 10–30% for typical apps; 50–80% for high-volume RAG with stable system prompts.
Retail price per request
What the end customer effectively pays per call. Set to 0 if bundled into a flat subscription and check unit margin separately.
Typical range: $0.001–$0.05 for chat; $0.10–$2 for agent runs; $0 for free or freemium tiers.
Worked examples
Real scenarios with the math walked through line by line.
B2B chatbot SaaS, GPT-4o
Scenario: 5,000 requests/day, 1,500 input tokens, 500 output tokens, $2.50/$10 pricing, 40% cache hit, $0.05/request retail.
Math: Cached input cost = 1,500/1M × $2.50 × (0.6 + 0.4×0.2) = $0.00255. Output = 500/1M × $10 = $0.005. Cost/req ≈ $0.0076. Daily = $38. Monthly = $1,140. Monthly revenue = $7,500. Gross margin = 85%.
Outcome: Healthy. The cache discount alone saves ~$340/mo. Watch for prompt bloat — adding 2k tokens to the system message would halve your margin.
High-volume classifier on Haiku
Scenario: 1M requests/day, 400 input + 80 output tokens, $0.15/$0.60 pricing, 10% cache, $0.001/request retail.
Math: Input ≈ $0.000058. Output = $0.000048. Cost/req ≈ $0.000106. Daily cost = $106. Monthly = $3,180. Monthly revenue = $30,000. Margin = 89%.
Outcome: Classifier economics work because you picked the right model. Routing this to GPT-4o would 17x your cost and erase the business.
Common mistakes
Where this calculation usually goes wrong in the real world.
- Modeling cost from a single 'typical' request instead of a token distribution. Real traffic has a long tail of huge requests that dominate the bill.
- Ignoring output tokens. Output is 3–5x the price of input — and most apps run with no max_tokens cap.
- Counting cache discounts before measuring them. Caches only help if the same prefix gets reused within the TTL; chatty apps with unique users often see <10% hit rates.
- Forgetting reasoning tokens. o1, o3, and Gemini Thinking models bill hidden reasoning as output. A 'short' answer can secretly burn 8,000 output tokens.
- Comparing vendors on input price alone. Always combine input + output × your real ratio before deciding.
When to use this calculator
- Pricing a new AI feature before launch — set retail price > 4x worst-case cost/request.
- Deciding whether to switch from frontier to mini-tier models (route 80% of traffic, keep 20% premium).
- Building an investor pitch that needs defensible gross margin claims.
- Negotiating an enterprise deal — model per-seat token usage at the customer's expected volume.
- Justifying engineering time spent on prompt caching, RAG chunk reduction, or function-calling cleanup.
Glossary
Token
The unit LLMs bill on. ~0.75 English words or ~4 characters per token. 1,000 tokens ≈ 750 words.
Context window
Maximum input + output tokens a model accepts in a single request. Pricing is per token used, not per context window size.
Prompt caching
Provider feature that stores shared input prefixes (system prompts, tool definitions) and bills cached tokens at 10–25% of normal.
Reasoning tokens
Hidden chain-of-thought tokens that reasoning models (o1, o3, Gemini Thinking) generate before the visible answer. Billed as output.
Gross margin
(Revenue − direct API cost) ÷ revenue. AI features below 60% gross margin usually can't survive scaling support and infra overhead.
More questions answered
Why is my actual bill 2–3x what this calculator predicts?
Three usual culprits: (1) retries from failed tool calls, rate-limit errors, and JSON-parse failures often double real request volume; (2) your token distribution has a long tail — average is misleading, p95 input is often 4x the mean; (3) you're billing reasoning tokens you didn't model. Pull a week of real usage from your provider dashboard and use medians plus p95, not means.
When does it make sense to self-host an open model?
Above roughly 20–50M tokens/day of sustained traffic on a single model, dedicated GPU hosting (Llama 3.1 70B, Qwen, Mistral) starts beating hosted APIs on raw cost. Below that, hosted wins once you include reliability, autoscaling, eval infrastructure, and engineering time. Use this calculator alongside the GPU cloud cost calculator to find the crossover for your specific workload.
How do I budget for an AI feature that's still in beta?
Triple your modeled cost-per-request to absorb prompt iteration, debug retries, and the fact that beta users behave nothing like production users. Track actual cost-per-successful-task weekly and adjust pricing or model routing before you exit beta — not after a viral launch.
Should batched API calls be priced differently?
Yes. OpenAI Batch API and Anthropic Message Batches give 50% off for async workloads with up to 24h latency. If you have nightly jobs, evals, or backfills, route them through batch endpoints — same model, half the cost.
Methodology last reviewed: 2026-05 by the RevenueLab editorial team.
FAQ
How are LLM API costs usually billed?
Per million tokens, split into input (prompt) and output (completion) prices. Output is typically 3–5× the input rate. Some vendors also bill separately for cached input, tool-use tokens, and image/audio inputs (per image or per second of audio).
How much do prompt caching discounts actually save?
Anthropic and OpenAI charge ~10–25% of normal price for cached input tokens, with a small write-cost on first use. For an app with a 4k-token system prompt and high request volume, cache hits commonly cut total input cost by 60–80%. Output tokens are never cached.
Should I use GPT-4 / Claude Sonnet for everything?
No. The cheapest reliable model that hits your quality bar usually wins. Route classification, extraction, and simple chat to Haiku, Mini, or Flash (10–20× cheaper). Reserve frontier models for reasoning-heavy work where output quality drives real revenue.
What about self-hosted open-source models?
Hosting Llama/Mistral/Qwen on your own GPUs only beats hosted APIs above roughly 10–50M tokens/day of sustained traffic, depending on model size. Below that, hosted APIs almost always win on total cost of ownership when you include reliability, scaling, and engineering time.