Prompt Regression Testing — BRVO Lab

How it was built

A Vitest test suite where each test case is a prompt + expected output pattern. Tests run against the current model version and check: does the output match the expected format (Zod validation)? Does it contain required information? Does the tone match (scored by a separate Claude call)? The suite runs in GitHub Actions on every PR that touches prompt files.

Claude APIGitHub ActionsVitest

What's being prototyped

Prototyping with 50 test cases across BRVO's own prompts (chat, audit, lead qualifier). Testing whether model version upgrades (e.g., Sonnet 3.5 → Sonnet 4) break existing behaviour. Early finding: 8 out of 50 test cases failed when switching from Sonnet 3.5 to Sonnet 4 — all were tone-related (Sonnet 4 is more formal by default). This caught issues before they reached production.

How BRVO uses this

Being developed to protect BRVO clients from model update surprises. When Anthropic or OpenAI releases a new model version, BRVO runs the regression suite against it before switching any client over. No more "the chatbot sounds different after the update" complaints.

Use cases

Any AI-powered product

A company relying on Claude for customer-facing chat. When Claude updates, the regression suite runs 200 test conversations and flags any where the tone, accuracy, or format has changed.

Content generation platform

A marketing tool generating social media posts. The suite checks that output length, hashtag usage, and brand voice haven't shifted after a model update.

Internal AI tools

A company using AI for document summarisation. The suite verifies that summaries still capture key points, maintain the right length, and don't introduce hallucinations.

PromptRegressionTesting

Overview