CI pipeline that catches prompt regressions before new model versions reach production.
A Vitest test suite where each test case is a prompt + expected output pattern. Tests run against the current model version and check: does the output match the expected format (Zod validation)? Does it contain required information? Does the tone match (scored by a separate Claude call)? The suite runs in GitHub Actions on every PR that touches prompt files.
Prototyping with 50 test cases across BRVO's own prompts (chat, audit, lead qualifier). Testing whether model version upgrades (e.g., Sonnet 3.5 → Sonnet 4) break existing behaviour. Early finding: 8 out of 50 test cases failed when switching from Sonnet 3.5 to Sonnet 4 — all were tone-related (Sonnet 4 is more formal by default). This caught issues before they reached production.
Being developed to protect BRVO clients from model update surprises. When Anthropic or OpenAI releases a new model version, BRVO runs the regression suite against it before switching any client over. No more "the chatbot sounds different after the update" complaints.
A company relying on Claude for customer-facing chat. When Claude updates, the regression suite runs 200 test conversations and flags any where the tone, accuracy, or format has changed.
A marketing tool generating social media posts. The suite checks that output length, hashtag usage, and brand voice haven't shifted after a model update.
A company using AI for document summarisation. The suite verifies that summaries still capture key points, maintain the right length, and don't introduce hallucinations.