AI costs · Free calculator

RAG Pipeline Cost Calculator

Estimate monthly RAG (retrieval-augmented generation) spend across embeddings, vector database, retrieval queries, reranking, and LLM completion. Pinecone, Qdrant, Weaviate, pgvector compared.

Disclaimer: Educational estimate. Provider pricing changes frequently and varies by region, commitment tier, cache hit rate, and model version. Always confirm current pricing on the provider's pricing page before budgeting at scale.

New here? Watch it work in 2 seconds — then tweak it for you.

Try it like this

Tap a scenario to load realistic numbers, then tweak the sliders.

Documents in corpus50,000

Avg tokens per document1,200

Embedding cost ($/M tokens)$0.02

text-embedding-3-small: $0.02 · 3-large: $0.13 · Cohere v3: $0.10

Vector DB monthly cost ($)$70.00

Pinecone s1.x1 ~$70 for 5M vectors · Qdrant Cloud: ~$50 · pgvector on existing PG: ~$0

User queries per day2,000

LLM completion cost per query ($)$0.01

GPT-4o-mini ~$0.003 · GPT-4.1 ~$0.015 · Claude 3.5 Sonnet ~$0.012

Corpus % re-embedded per month8%

How much of your corpus changes monthly.

Formula used

RAG monthly spend

RAG cost has four buckets: (1) initial embedding to populate the index, (2) ongoing re-embedding as documents change, (3) vector DB hosting, and (4) per-query LLM completion. In most production systems, completion cost is 70–90% of monthly spend. Vector DB is the smallest line item but the most over-provisioned — teams routinely pay 3–5× what they need.

Embed$ = (Docs × Tokens ÷ 1M) × $/M ; Monthly = VectorDB$ + Embed$ × Refresh% + Queries × Days × $/query

OpenAI 3-small (default)

$0.02/M tokens

OpenAI 3-large (high recall)

$0.13/M tokens

Pinecone serverless

Pay-per-query

pgvector on existing PG

~$0 add-on

Backlink-friendly embed

Embed this calculator

Free to embed on any site. Inputs preserved, link back to RevenueLab. Each format trades polish for SEO juice.

WidthHeight (px)Theme

<iframe src="https://revenuelab.fyi/embed/rag-pipeline-cost-calculator?docCount=50000&tokensPerDoc=1200&embeddingCostPerM=0.02&vectorDbMonthly=70&queriesPerDay=2000&completionCostPerQuery=0.008&refreshPctMonthly=8" width="100%" height="680" style="border:0;border-radius:12px;max-width:100%" loading="lazy" title="RAG Pipeline Cost Calculator"></iframe>
<p style="font:12px/1.4 system-ui;color:#666;margin:6px 0 0">Calculator by <a href="https://revenuelab.fyi/rag-pipeline-cost-calculator?docCount=50000&tokensPerDoc=1200&embeddingCostPerM=0.02&vectorDbMonthly=70&queriesPerDay=2000&completionCostPerQuery=0.008&refreshPctMonthly=8" target="_blank" rel="noopener">RevenueLab</a></p>

Easiest to install — passes referral traffic and a referring-domain signal.

Cite this calculator

Writing about this topic? Grab a citation — every link helps keep these tools free.

APA

RevenueLab. (2026). RAG Pipeline Cost Calculator. Retrieved from https://revenuelab.fyi/rag-pipeline-cost-calculator

HTML

<p>Source: <a href="https://revenuelab.fyi/rag-pipeline-cost-calculator" target="_blank" rel="noopener">RAG Pipeline Cost Calculator — RevenueLab</a> (2026).</p>

Markdown

Source: [RAG Pipeline Cost Calculator — RevenueLab](https://revenuelab.fyi/rag-pipeline-cost-calculator) (2026).

Four cost drivers ranked by impact

(1) LLM completion model — switching GPT-4.1 to GPT-4o-mini cuts per-query cost 5× with usually <10% answer-quality drop on grounded RAG questions. (2) Re-embed frequency — if your corpus changes <10% monthly, schedule incremental embed, not full re-embed. (3) Vector DB tier — Pinecone serverless and pgvector are dramatically cheaper than dedicated Pinecone or Weaviate clusters under 10M vectors. (4) Embedding model — 3-small handles 80% of use cases at 16% of 3-large's price.

When pgvector beats Pinecone (and when it doesn't)

If you already run Postgres and your index is under 5M vectors with <100 QPS, pgvector with IVFFlat or HNSW indexing is free, fast enough, and operationally simpler. Pinecone serverless wins above 10M vectors, above 200 QPS, or when you need filtered queries with sub-50ms p99 latency. Qdrant Cloud and Weaviate Cloud sit between — pick by team familiarity, not benchmark micro-differences.

• pgvector: <5M vectors, <100 QPS, existing Postgres
• Pinecone serverless: bursty traffic, no infra team
• Pinecone pods: >50M vectors, sustained high QPS
• Qdrant / Weaviate: self-host control + cloud option
• Turbopuffer / Vespa: cost-optimized at scale (>100M vectors)

Where reranking pays for itself

Cohere Rerank 3 or Voyage rerank at ~$0.001/query typically lifts top-5 answer relevance 15–30%, which usually saves more in customer-support handle time or trust than the rerank costs. A 10k-query/day system pays $300/mo for rerank and often saves $2,000+/mo in misrouted tickets and bad-answer escalations.

Hidden cost: re-embedding cadence

Many teams accidentally re-embed the full corpus weekly via a 'safe' cron job. For a 1M-document, 2k-token corpus, that's $40/week or $2,080/year of avoidable cost. Use document checksums and only embed deltas — typical real change rate is 4–12% per month, not 100% per week.

Related guides

Long-form playbooks on the same topic, written by the RevenueLab editorial team.

Guide · 11 min read

LLM Token Costs in 2026: Pricing Every Model, Hidden Multipliers, and Margin Math

Input vs output token pricing across GPT, Claude, and Gemini, the context-window cost trap, how caching and batching cut bills 40–80%, and the real per-user margin most AI apps miss.

Read the guide

FAQ

Should I use OpenAI or open-source embeddings?

Open-source (bge-large, nomic, jina) on self-hosted GPU is free per-token but adds infra. For corpora under 10M docs, OpenAI text-embedding-3-small at $0.02/M tokens is usually cheaper all-in than running your own GPU. Above that, self-host.

Do I need a vector database at all?

For <100k documents you can store embeddings as numpy arrays and brute-force search in memory. Vector DBs become necessary when you need filtered queries, sub-100ms latency, or persistence across restarts.

What's the right chunk size?

300–600 tokens with 50–100 token overlap is a solid default for narrative text. Code: chunk by function or class. Tables and structured docs: chunk by row groups. Test 3 chunk sizes on your eval set — chunking choice often beats embedding model choice.

How do I reduce completion cost without losing quality?

Three levers: (1) shrink retrieved context — top-3 reranked chunks usually beat top-10 raw, (2) cache common Q&A pairs, (3) route easy queries to GPT-4o-mini and hard ones to GPT-4.1 via a cheap classifier.

Are embeddings billed if cached?

OpenAI does not cache embeddings — every call is billed. You're responsible for caching the embedding alongside the document. Embed once, store the vector, never recompute unless the source text changes.

What about hybrid search?

Combining BM25 (keyword) + vector similarity typically lifts recall 10–25% on technical or jargon-heavy corpora at minimal added cost. Weaviate, Qdrant, and OpenSearch ship hybrid out of the box.

Does GPT-4.1 reasoning replace RAG?

No — long-context models can fit 1M tokens but per-query cost scales linearly with context. For 100k+ doc corpora, RAG remains 50–500× cheaper per query than stuffing context. Use long-context for one-shot deep analysis, RAG for recurring search.

How do I budget for a RAG POC?

Most POCs run $50–$300/month for the first 3 months: $0 vector DB (pgvector or free Pinecone tier), $10–50 in embeddings, $40–250 in completions. Production scaling is where the real cost decisions hit.

How this calculator is built

Independently maintained

Written by Sam Doshi and the RevenueLab editorial team. We don't sell the data feeds this tool is built on.

Sourced from primary data

Benchmarks come from public AdSense / Stripe / IRS disclosures and reader-submitted data — never third-party "$X per view" claims. Full methodology.

Last reviewed

June 2026. We re-check every figure on the platform on a rolling quarterly cycle.

Editorial standards

See our editorial policy and disclaimer. Results are estimates, not advice.