Two-Phase Automated Product-Validation Pipeline

~58%

of historical clarification cycles automatically catchable (pilot, 7 tickets)

false positives across 4 shipped bundles after calibration

~$0.75

and ~5 min per AI validation run

The problem

Onboarding or updating an insurance product depended on a manual, error-prone handoff: developers interpreted raw insurer rate sheets and benefit tables by hand, wrote test cases, and spot-checked rates against the quoting engine. Data problems surfaced late — mid-development — after engineering effort was already spent, and deterministic extractors existed for only ~3–4 carriers.

The approach

Designed a layered, two-phase validation architecture so AI improves coverage and intake speed while deterministic execution against the live engine remains the final authority.
Phase 1 — developer-stage automation: extracted raw insurer files (Poppler, openpyxl), generated Playwright tests, ran them against the real quoting engine, and versioned the test-case CSVs in S3 as the contract between dev and QA (dashboard with RBAC + audit trail).
Phase 2 — a shift-left IPM gate: a headless Claude CLI first-pass reviewer runs a 5-checkpoint review over raw insurer files (image- and checkbox-aware PDF reading), producing a plain-language report, a four-level root-cause classification, and a generated test-case CSV — then a gated 'Dev Ready' handoff into Jira.
Positioned the LLM strictly as a first-pass reviewer, never the final authority — deterministic Playwright execution stays the source of truth.

The outcome

Removed the per-carrier tooling bottleneck: new carrier documents can be validated on day one, without building bespoke deterministic extractors first.
Retroactive pilot across 7 production tickets found up to ~58% of historical dev↔PM clarification cycles were automatically catchable (baseline 3.7 queries/ticket).
Validated bit-exact rate derivation on real insurer data; 0 false positives across 4 already-shipped bundles while catching 6 of 6 seeded defects; model/prompt versions pinned per run for auditability.
Hardened for horizontal scale: shared BullMQ/Redis queue (exactly-once execution across a load-balanced fleet), per-product idempotency, and surgical cross-instance job cancellation — verified live with zero duplicate or orphaned processes.

Honest note: Validated in pilot and moving from staging to UAT — not yet production. Pilot accuracy (~70–80% single-plan) is directional and needs broader multi-plan validation; figures are bounded to pilot/staging evidence.

← All work Work with me on this