Where AI costs actually leak
Most teams under-estimate output tokens by 3–5×. A 'short answer' from an unconstrained model averages 400–800 tokens. The fix is hard system-prompt limits ('respond in <= 80 words') and aggressive max_tokens, not just choosing a cheaper model.
- • Prompt caching cuts repeated system-prompt cost by 50–90%.
- • Batch API endpoints are 50% cheaper for non-realtime work.
- • Cheaper models often need 1.5× more retries — net cost can be higher.
Pricing your AI feature
If a user runs ~100 requests/mo at $0.02 cost/request, you need to charge $7+/mo for a 70% gross margin. Free tiers should hard-cap requests, not tokens — easier to communicate, harder to abuse.

Comparing GPT-5, Claude, and Gemini API pricing on input tokens alone misses 60% of the real cost: output tokens are 3–10× more expensive, prompt caching changes economics entirely, and tool-calling rounds blow up your token count. This calculator models true blended cost across providers so you can pick the right model per workload, not per blog post.
What each input means
Get these inputs right and the output is reliable. Get them wrong and the calculator just multiplies bad assumptions.
Monthly requests
Total inference calls across the model.
Typical range: Highly workload-specific — measure with logging, don't guess.
Average input tokens
System prompt + user message + RAG context.
Typical range: 500–2,000 chat; 5,000–20,000 RAG; 20,000+ for code-context agents.
Average output tokens
Model response length.
Typical range: 200–800 chat; 1,000–4,000 long-form; 100–500 structured extraction.
Cache hit rate (if supported)
Share of input tokens served from prompt cache.
Typical range: 30–70% on assistants with stable system prompts; 0% if every request is unique.
Worked examples
Real scenarios with the math walked through line by line.
Chat product, GPT-5 mini
Scenario: 500,000 requests/mo, 1,500 input tokens, 600 output tokens, no cache. Pricing: $0.30 input / $2.40 output per 1M tokens.
Math: Input cost = 500k × 1,500 / 1M × $0.30 = $225. Output = 500k × 600 / 1M × $2.40 = $720. Total ≈ $945/mo.
Outcome: Predictable scaling; output dominates as expected. Upgrading to GPT-5 standard ~5–8× this.
RAG product, Claude Sonnet 4 with caching
Scenario: 200,000 requests, 12,000 input tokens (8,000 cacheable), 800 output. Pricing ≈ $3 input / $0.30 cached / $15 output per 1M.
Math: Cacheable input = 200k × 8,000 / 1M × $0.30 = $480. Non-cached = 200k × 4,000 / 1M × $3 = $2,400. Output = 200k × 800 / 1M × $15 = $2,400. Total ≈ $5,280.
Outcome: Without caching, same workload would cost ~$9,600 — caching cuts bill 45%.
Common mistakes
Where this calculation usually goes wrong in the real world.
- Pricing on input only. Output tokens drive 60–80% of typical chat costs.
- Forgetting tool-calling round trips. Each tool call is a separate request with its own token billing.
- Comparing models on identical prompts. Different models need different prompt structures — cost-per-task is the real metric.
- Skipping prompt caching where available. 30–80% bill reduction on workloads with stable system prompts.
- Modeling at peak token count. Most chat sessions are far below max context.
When to use this calculator
- Estimating monthly bill before scaling a feature.
- Comparing GPT-5/Claude/Gemini for a specific workload.
- Modeling the savings from prompt caching or batching.
- Setting per-user cost budgets for a freemium AI feature.
Glossary
Token
Roughly ¾ of a word in English. Models bill by tokens, not characters or words.
Prompt cache
Server-side reuse of repeated prompt prefixes. Reduces input billing 60–90% on cached portions.
Context window
Maximum tokens the model can process in a single request (input + output combined).
More questions answered
Is GPT-5 or Claude cheaper per task?
Depends on the task. GPT-5 standard runs $1.25 input / $10 output per million tokens; Claude Sonnet 4 is $3 / $15. For pure cost, GPT-5 mini and Claude Haiku undercut both at <$0.50 / $3. Choose on quality-per-dollar for your specific workload — run 100 representative prompts against 3 candidate models and grade outputs before choosing on price alone.
When should I use Gemini Flash vs Pro?
Flash for high-volume, latency-sensitive workloads (chat, classification, simple extraction) — typically 10–20× cheaper than Pro. Pro for tasks requiring strong reasoning, long-context analysis, or complex multimodal input. Many production systems route 80%+ of traffic to Flash and reserve Pro for the 10–20% of requests that fail Flash's quality bar.
How much can prompt caching realistically save?
Workload-dependent: chatbots with stable system prompts see 50–70% cost reduction; RAG with reusable doc embeddings see 40–60%; code agents with shared file context can hit 80%+. Anthropic and OpenAI both support ephemeral and persistent caching; the engineering cost to enable it is typically <1 day for >$1k/mo workloads, making it the highest-ROI optimization.
Related guides
Long-form playbooks on the same topic, written by the RevenueLab editorial team.
LLM Token Costs in 2026: Pricing Every Model, Hidden Multipliers, and Margin Math
Input vs output token pricing across GPT, Claude, and Gemini, the context-window cost trap, how caching and batching cut bills 40–80%, and the real per-user margin most AI apps miss.
Read the guideSaaS Pricing Strategy: Per-Seat, Usage, Tiers, and the Hybrid Future
A framework for choosing a SaaS pricing model — when per-seat caps your growth, when usage-based makes revenue volatile, and how hybrid models stitch the two together.
Read the guideRevenue Model Glossary: 40 Terms Every Operator Should Know (CPM, RPM, ARPU, LTV, CAC, ROAS…)
Plain-English definitions for the 40 revenue-modeling terms that show up across creator, SaaS, ecommerce, and ads — with the exact formula, a worked example, and the most common misuse for each.
Read the guideMethodology last reviewed: 2026-05 by the RevenueLab editorial team.
FAQ
How much does GPT-5 cost per request?
At ~1200 input / 400 output tokens (typical chat), GPT-5 costs about $0.005 per request before caching, or $0.003 with 70% input caching. Roughly half the price of GPT-4o.
Is Claude or Gemini cheaper than GPT-5?
Gemini 2.5 Flash is the cheapest tier ($0.30 in / $2.50 out per 1M). Claude Sonnet 4.5 is the most expensive of the three ($3 / $15). GPT-5 sits in the middle.
How do I lower my AI bill?
Three biggest levers: (1) enable prompt caching, (2) cap max_tokens hard, (3) use a cheaper model for routing/classification and the big model only for final output.
How this calculator is built
Independently maintained
Written by Sam Doshi and the RevenueLab editorial team. We don't sell the data feeds this tool is built on.
Sourced from primary data
Benchmarks come from public AdSense / Stripe / IRS disclosures and reader-submitted data — never third-party "$X per view" claims. Full methodology.
Last reviewed
June 2026. We re-check every figure on the platform on a rolling quarterly cycle.
Editorial standards
See our editorial policy and disclaimer. Results are estimates, not advice.