LLM-as-a-Judge for Voice Agents: Testing Non-Deterministic AI with Simulated Callers

You cannot unit-test a conversation. The testing playbook for production voice agents: a four-layer test pyramid, simulated callers over real audio, LLM-as-a-judge scoring calibrated to design intent, the transcript-integrity trap, and the 2-of-3 flake rule.

Continue ReadingLLM-as-a-Judge for Voice Agents: Testing Non-Deterministic AI with Simulated Callers