The problem
Onboarding or updating an insurance product depended on a manual, error-prone handoff: developers interpreted raw insurer rate sheets and benefit tables by hand, wrote test cases, and spot-checked rates against the quoting engine. Data problems surfaced late — mid-development — after engineering effort was already spent, and deterministic extractors existed for only ~3–4 carriers.
The approach
- Designed a layered, two-phase validation architecture so AI improves coverage and intake speed while deterministic execution against the live engine remains the final authority.
- Phase 1 — developer-stage automation: extracted raw insurer files (Poppler, openpyxl), generated Playwright tests, ran them against the real quoting engine, and versioned the test-case CSVs in S3 as the contract between dev and QA (dashboard with RBAC + audit trail).
- Phase 2 — a shift-left IPM gate: a headless Claude CLI first-pass reviewer runs a 5-checkpoint review over raw insurer files (image- and checkbox-aware PDF reading), producing a plain-language report, a four-level root-cause classification, and a generated test-case CSV — then a gated 'Dev Ready' handoff into Jira.
- Positioned the LLM strictly as a first-pass reviewer, never the final authority — deterministic Playwright execution stays the source of truth.
The outcome
- Removed the per-carrier tooling bottleneck: new carrier documents can be validated on day one, without building bespoke deterministic extractors first.
- Retroactive pilot across 7 production tickets found up to ~58% of historical dev↔PM clarification cycles were automatically catchable (baseline 3.7 queries/ticket).
- Validated bit-exact rate derivation on real insurer data; 0 false positives across 4 already-shipped bundles while catching 6 of 6 seeded defects; model/prompt versions pinned per run for auditability.
- Hardened for horizontal scale: shared BullMQ/Redis queue (exactly-once execution across a load-balanced fleet), per-product idempotency, and surgical cross-instance job cancellation — verified live with zero duplicate or orphaned processes.
Honest note: Validated in pilot and moving from staging to UAT — not yet production. Pilot accuracy (~70–80% single-plan) is directional and needs broader multi-plan validation; figures are bounded to pilot/staging evidence.