RAGAccuracyScoring

StatusLive
StackSupabase Vector, Claude API, pgvector

Automated pipeline measuring how accurately chatbots retrieve and cite the right information from client knowledge bases.

The pipeline ingests a client's knowledge base (PDFs, website content, product data), chunks it using four different strategies (fixed-size, semantic, sentence-level, and recursive), embeds each variant into Supabase pgvector, then runs 50 test questions per knowledge base. Claude scores each answer on retrieval precision (did it find the right chunks?) and faithfulness (does the answer match what the chunks actually say?).

Supabase VectorClaude APIpgvector

Running on 3 active client knowledge bases: a restaurant menu (420 items with allergen data), a travel agency destination guide (180 pages), and a SaaS product documentation set (90 articles). Semantic chunking consistently outperforms fixed-size by 23% on retrieval precision. Recursive chunking wins on long-form content. Current accuracy: 94.2% faithfulness score across all three bases.

Every RAG chatbot BRVO builds goes through this scoring pipeline before launch. The Ember Kitchen chatbot achieved 96.8% accuracy on allergen questions after two rounds of chunk optimisation. This is how BRVO guarantees chatbot quality — not by hoping the AI gets it right, but by measuring it.

E-commerce product assistant

A chatbot answering questions about 5,000 products. The scoring pipeline ensures it recommends the right product 94%+ of the time, not a similar-sounding one that's out of stock.

Legal document Q&A

A law firm's internal tool where accuracy isn't optional. The pipeline catches hallucinations before they reach a solicitor — every answer is grounded in the actual document.

Internal knowledge base

A 200-person company with scattered documentation across Notion, Google Drive, and Slack. The chatbot finds the right answer regardless of where it lives, scored against human-verified test questions.

Want this for your business?

Start a sprint
Back toLab