LLM-as-a-Judge for Voice Agents: Testing Non-Deterministic AI with Simulated Callers

You cannot unit-test a conversation. The testing playbook for production voice agents: a four-layer test pyramid, simulated callers over real audio, LLM-as-a-judge scoring calibrated to design intent, the transcript-integrity trap, and the 2-of-3 flake rule.

Continue ReadingLLM-as-a-Judge for Voice Agents: Testing Non-Deterministic AI with Simulated Callers

AI Voice Agent Architecture: What I Learned Building the Same Agent Three Times

I built the same production voice agent three times. The orchestrator collapsed under coupling, server-gated turns created dead air, and the third architecture — where the realtime model owns the conversation — is the one that survived. Pros, cons, and diagrams of all three.

Continue ReadingAI Voice Agent Architecture: What I Learned Building the Same Agent Three Times
Read more about the article Why AI Software Development Breaks Traditional Software Engineering
Fix one prompt — re-test everything.

Why AI Software Development Breaks Traditional Software Engineering

Fixing an AI bug isn't like fixing a regular bug — every prompt change ripples through the whole system. Why AI software development needs a new kind of regression testing, and why traditional software engineering doesn't apply.

Continue ReadingWhy AI Software Development Breaks Traditional Software Engineering