Four cost drivers ranked by impact
(1) LLM completion model — switching GPT-4.1 to GPT-4o-mini cuts per-query cost 5× with usually <10% answer-quality drop on grounded RAG questions. (2) Re-embed frequency — if your corpus changes <10% monthly, schedule incremental embed, not full re-embed. (3) Vector DB tier — Pinecone serverless and pgvector are dramatically cheaper than dedicated Pinecone or Weaviate clusters under 10M vectors. (4) Embedding model — 3-small handles 80% of use cases at 16% of 3-large's price.
When pgvector beats Pinecone (and when it doesn't)
If you already run Postgres and your index is under 5M vectors with <100 QPS, pgvector with IVFFlat or HNSW indexing is free, fast enough, and operationally simpler. Pinecone serverless wins above 10M vectors, above 200 QPS, or when you need filtered queries with sub-50ms p99 latency. Qdrant Cloud and Weaviate Cloud sit between — pick by team familiarity, not benchmark micro-differences.
- • pgvector: <5M vectors, <100 QPS, existing Postgres
- • Pinecone serverless: bursty traffic, no infra team
- • Pinecone pods: >50M vectors, sustained high QPS
- • Qdrant / Weaviate: self-host control + cloud option
- • Turbopuffer / Vespa: cost-optimized at scale (>100M vectors)
Where reranking pays for itself
Cohere Rerank 3 or Voyage rerank at ~$0.001/query typically lifts top-5 answer relevance 15–30%, which usually saves more in customer-support handle time or trust than the rerank costs. A 10k-query/day system pays $300/mo for rerank and often saves $2,000+/mo in misrouted tickets and bad-answer escalations.
Hidden cost: re-embedding cadence
Many teams accidentally re-embed the full corpus weekly via a 'safe' cron job. For a 1M-document, 2k-token corpus, that's $40/week or $2,080/year of avoidable cost. Use document checksums and only embed deltas — typical real change rate is 4–12% per month, not 100% per week.
Related guides
Long-form playbooks on the same topic, written by the RevenueLab editorial team.
FAQ
Should I use OpenAI or open-source embeddings?
Open-source (bge-large, nomic, jina) on self-hosted GPU is free per-token but adds infra. For corpora under 10M docs, OpenAI text-embedding-3-small at $0.02/M tokens is usually cheaper all-in than running your own GPU. Above that, self-host.
Do I need a vector database at all?
For <100k documents you can store embeddings as numpy arrays and brute-force search in memory. Vector DBs become necessary when you need filtered queries, sub-100ms latency, or persistence across restarts.
What's the right chunk size?
300–600 tokens with 50–100 token overlap is a solid default for narrative text. Code: chunk by function or class. Tables and structured docs: chunk by row groups. Test 3 chunk sizes on your eval set — chunking choice often beats embedding model choice.
How do I reduce completion cost without losing quality?
Three levers: (1) shrink retrieved context — top-3 reranked chunks usually beat top-10 raw, (2) cache common Q&A pairs, (3) route easy queries to GPT-4o-mini and hard ones to GPT-4.1 via a cheap classifier.
Are embeddings billed if cached?
OpenAI does not cache embeddings — every call is billed. You're responsible for caching the embedding alongside the document. Embed once, store the vector, never recompute unless the source text changes.
What about hybrid search?
Combining BM25 (keyword) + vector similarity typically lifts recall 10–25% on technical or jargon-heavy corpora at minimal added cost. Weaviate, Qdrant, and OpenSearch ship hybrid out of the box.
Does GPT-4.1 reasoning replace RAG?
No — long-context models can fit 1M tokens but per-query cost scales linearly with context. For 100k+ doc corpora, RAG remains 50–500× cheaper per query than stuffing context. Use long-context for one-shot deep analysis, RAG for recurring search.
How do I budget for a RAG POC?
Most POCs run $50–$300/month for the first 3 months: $0 vector DB (pgvector or free Pinecone tier), $10–50 in embeddings, $40–250 in completions. Production scaling is where the real cost decisions hit.
How this calculator is built
Independently maintained
Written by Sam Doshi and the RevenueLab editorial team. We don't sell the data feeds this tool is built on.
Sourced from primary data
Benchmarks come from public AdSense / Stripe / IRS disclosures and reader-submitted data — never third-party "$X per view" claims. Full methodology.
Last reviewed
June 2026. We re-check every figure on the platform on a rolling quarterly cycle.
Editorial standards
See our editorial policy and disclaimer. Results are estimates, not advice.