AI infra · OpEx · Free calculator

LLM Inference Cost Calculator

Estimate monthly OpenAI, Anthropic, Google, and open-source LLM bills from input/output tokens, requests per day, and per-million-token pricing. Models cache hits, batched discounts, and gross margin at a given retail price.

Disclaimer: Vendor pricing changes frequently. Plug in current per-million-token prices from your provider's pricing page for production budgets. Estimates exclude embedding costs, fine-tuning fees, and image/audio multimodal charges.

Scenarios
Common scenarios

Tap a persona to auto-load realistic numbers for that scenario, then tweak the sliders.

5,000
1,200

System prompt + user message + retrieved context. RAG apps often 2–8k.

400
$2.50

GPT-4o ≈ $2.50, Claude Sonnet ≈ $3, Gemini 2.5 Pro ≈ $1.25, Haiku/Mini ≈ $0.15–0.25.

$10.00

Output is usually 3–5× input. GPT-4o ≈ $10, Claude Sonnet ≈ $15, o1 ≈ $60.

30%

Cached input tokens cost ~10–25% of normal. Big system prompts benefit most.

$0.05

Set to 0 if you bundle into a subscription.

Formula used

LLM cost formula

Almost all frontier APIs price per million tokens. Input and output are billed separately; output is typically 3–5× the input rate. Prompt caching (Anthropic, OpenAI, Gemini) discounts cached input tokens by 75–90%.

Cost/req = (InTok/1M × InPrice × CacheMult) + (OutTok/1M × OutPrice)
Output / input price
~3–5×
Cache discount
75–90%
Mini-tier price
~10–20% of frontier
Backlink-friendly embed

Embed this calculator

Free to embed on any site. Inputs preserved, link back to RevenueLab. Each format trades polish for SEO juice.

<iframe src="https://revenuelab.fyi/embed/llm-inference-cost-calculator?requestsPerDay=5000&inputTokens=1200&outputTokens=400&inputPrice=2.5&outputPrice=10&cacheHitPct=30&pricePerRequest=0.05" width="100%" height="680" style="border:0;border-radius:12px;max-width:100%" loading="lazy" title="LLM Inference Cost Calculator"></iframe>
<p style="font:12px/1.4 system-ui;color:#666;margin:6px 0 0">Calculator by <a href="https://revenuelab.fyi/llm-inference-cost-calculator?requestsPerDay=5000&inputTokens=1200&outputTokens=400&inputPrice=2.5&outputPrice=10&cacheHitPct=30&pricePerRequest=0.05" target="_blank" rel="noopener">RevenueLab</a></p>

Easiest to install — passes referral traffic and a referring-domain signal.

Cite this calculator

Writing about this topic? Grab a citation — every link helps keep these tools free.

APA
RevenueLab. (2026). LLM Inference Cost Calculator. Retrieved from https://revenuelab.fyi/llm-inference-cost-calculator
HTML
<p>Source: <a href="https://revenuelab.fyi/llm-inference-cost-calculator" target="_blank" rel="noopener">LLM Inference Cost Calculator — RevenueLab</a> (2026).</p>
Markdown
Source: [LLM Inference Cost Calculator — RevenueLab](https://revenuelab.fyi/llm-inference-cost-calculator) (2026).

Token math, in plain English

One token ≈ 0.75 English words, ≈ 4 characters. A 1,000-word doc ≈ 1,300 tokens. A 'long' chat reply ≈ 500–800 tokens. Vendors price per 1,000,000 tokens — so $2.50/1M means $0.0025 per 1,000 tokens, or $0.000025 per 1k chars. Multiply by request volume and the rounding error becomes a real budget line fast.

  • Input is what you send (system + user + retrieved context + tool definitions).
  • Output is what the model generates (the assistant message + any tool args).
  • Tool/function-calling tokens count as output. Long JSON tool schemas count as input on every call.

Why your bill explodes faster than your traffic

LLM bills scale with tokens × requests, not just requests. Three things compound: (1) context bloat — devs add 'just one more example' to the system prompt and quietly 4× input tokens; (2) RAG retrieval — pulling top-20 chunks instead of top-3 multiplies input cost without measurably better answers; (3) reasoning models — o1/o3/Gemini Thinking burn thousands of hidden reasoning tokens billed as output. Track $/successful-task as your north star, not $/token.

Rex's Notes

Most teams discover their LLM bill three months after shipping, when finance forwards a $40k AWS-style invoice and asks what happened. The honest answer is almost always the same: nobody modeled cost per request before launch. This calculator does that math in 30 seconds — input tokens, output tokens, cache hit rate, and the retail price you charge — and tells you whether your unit economics survive contact with real traffic.

What each input means

Get these inputs right and the output is reliable. Get them wrong and the calculator just multiplies bad assumptions.

Requests per day

Total inference calls hitting the API, including retries and background jobs.

Typical range: 1k–10k for early SaaS; 100k–1M for consumer apps; 10M+ for embedded classifiers.

Avg input tokens / request

System prompt + user message + retrieved RAG chunks + tool/function definitions. Use a real distribution, not a guess.

Typical range: 500–2,000 for chat; 4k–12k for RAG; 200–600 for classification.

Avg output tokens / request

Assistant reply + any tool-call arguments. Reasoning models burn 2–10x more invisible tokens.

Typical range: 100–500 for chat; 500–2,000 for code/long-form; 4k+ for o1/o3/Thinking models.

Input / output price ($/1M)

Vendor pricing per million tokens. Output is almost always 3–5x input.

Typical range: Frontier: $2.50/$10 (GPT-4o), $3/$15 (Sonnet). Mini-tier: $0.15/$0.60 (Haiku, Mini, Flash).

Prompt cache hit %

Share of input tokens served from prompt cache. Big shared system prompts benefit most; one-off requests don't.

Typical range: 10–30% for typical apps; 50–80% for high-volume RAG with stable system prompts.

Retail price per request

What the end customer effectively pays per call. Set to 0 if bundled into a flat subscription and check unit margin separately.

Typical range: $0.001–$0.05 for chat; $0.10–$2 for agent runs; $0 for free or freemium tiers.

Worked examples

Real scenarios with the math walked through line by line.

Example

B2B chatbot SaaS, GPT-4o

Scenario: 5,000 requests/day, 1,500 input tokens, 500 output tokens, $2.50/$10 pricing, 40% cache hit, $0.05/request retail.

Math: Cached input cost = 1,500/1M × $2.50 × (0.6 + 0.4×0.2) = $0.00255. Output = 500/1M × $10 = $0.005. Cost/req ≈ $0.0076. Daily = $38. Monthly = $1,140. Monthly revenue = $7,500. Gross margin = 85%.

Outcome: Healthy. The cache discount alone saves ~$340/mo. Watch for prompt bloat — adding 2k tokens to the system message would halve your margin.

Example

High-volume classifier on Haiku

Scenario: 1M requests/day, 400 input + 80 output tokens, $0.15/$0.60 pricing, 10% cache, $0.001/request retail.

Math: Input ≈ $0.000058. Output = $0.000048. Cost/req ≈ $0.000106. Daily cost = $106. Monthly = $3,180. Monthly revenue = $30,000. Margin = 89%.

Outcome: Classifier economics work because you picked the right model. Routing this to GPT-4o would 17x your cost and erase the business.

Common mistakes

Where this calculation usually goes wrong in the real world.

  • Modeling cost from a single 'typical' request instead of a token distribution. Real traffic has a long tail of huge requests that dominate the bill.
  • Ignoring output tokens. Output is 3–5x the price of input — and most apps run with no max_tokens cap.
  • Counting cache discounts before measuring them. Caches only help if the same prefix gets reused within the TTL; chatty apps with unique users often see <10% hit rates.
  • Forgetting reasoning tokens. o1, o3, and Gemini Thinking models bill hidden reasoning as output. A 'short' answer can secretly burn 8,000 output tokens.
  • Comparing vendors on input price alone. Always combine input + output × your real ratio before deciding.

When to use this calculator

  • Pricing a new AI feature before launch — set retail price > 4x worst-case cost/request.
  • Deciding whether to switch from frontier to mini-tier models (route 80% of traffic, keep 20% premium).
  • Building an investor pitch that needs defensible gross margin claims.
  • Negotiating an enterprise deal — model per-seat token usage at the customer's expected volume.
  • Justifying engineering time spent on prompt caching, RAG chunk reduction, or function-calling cleanup.

Glossary

Term

Token

The unit LLMs bill on. ~0.75 English words or ~4 characters per token. 1,000 tokens ≈ 750 words.

Term

Context window

Maximum input + output tokens a model accepts in a single request. Pricing is per token used, not per context window size.

Term

Prompt caching

Provider feature that stores shared input prefixes (system prompts, tool definitions) and bills cached tokens at 10–25% of normal.

Term

Reasoning tokens

Hidden chain-of-thought tokens that reasoning models (o1, o3, Gemini Thinking) generate before the visible answer. Billed as output.

Term

Gross margin

(Revenue − direct API cost) ÷ revenue. AI features below 60% gross margin usually can't survive scaling support and infra overhead.

More questions answered

Why is my actual bill 2–3x what this calculator predicts?

Three usual culprits: (1) retries from failed tool calls, rate-limit errors, and JSON-parse failures often double real request volume; (2) your token distribution has a long tail — average is misleading, p95 input is often 4x the mean; (3) you're billing reasoning tokens you didn't model. Pull a week of real usage from your provider dashboard and use medians plus p95, not means.

When does it make sense to self-host an open model?

Above roughly 20–50M tokens/day of sustained traffic on a single model, dedicated GPU hosting (Llama 3.1 70B, Qwen, Mistral) starts beating hosted APIs on raw cost. Below that, hosted wins once you include reliability, autoscaling, eval infrastructure, and engineering time. Use this calculator alongside the GPU cloud cost calculator to find the crossover for your specific workload.

How do I budget for an AI feature that's still in beta?

Triple your modeled cost-per-request to absorb prompt iteration, debug retries, and the fact that beta users behave nothing like production users. Track actual cost-per-successful-task weekly and adjust pricing or model routing before you exit beta — not after a viral launch.

Should batched API calls be priced differently?

Yes. OpenAI Batch API and Anthropic Message Batches give 50% off for async workloads with up to 24h latency. If you have nightly jobs, evals, or backfills, route them through batch endpoints — same model, half the cost.

Methodology last reviewed: 2026-05 by the RevenueLab editorial team.

FAQ

How are LLM API costs usually billed?

Per million tokens, split into input (prompt) and output (completion) prices. Output is typically 3–5× the input rate. Some vendors also bill separately for cached input, tool-use tokens, and image/audio inputs (per image or per second of audio).

How much do prompt caching discounts actually save?

Anthropic and OpenAI charge ~10–25% of normal price for cached input tokens, with a small write-cost on first use. For an app with a 4k-token system prompt and high request volume, cache hits commonly cut total input cost by 60–80%. Output tokens are never cached.

Should I use GPT-4 / Claude Sonnet for everything?

No. The cheapest reliable model that hits your quality bar usually wins. Route classification, extraction, and simple chat to Haiku, Mini, or Flash (10–20× cheaper). Reserve frontier models for reasoning-heavy work where output quality drives real revenue.

What about self-hosted open-source models?

Hosting Llama/Mistral/Qwen on your own GPUs only beats hosted APIs above roughly 10–50M tokens/day of sustained traffic, depending on model size. Below that, hosted APIs almost always win on total cost of ownership when you include reliability, scaling, and engineering time.