The multi-turn failures that prompt evals can't see

Most agent failures we see in pilots don't show up on prompt evals.

Hallucinations are a genuine phenomenon and warrant testing - we also wrote about how Voxli detects them in a previous post- however, they are a visible failure which are more often than not found screenshotted and posted on Reddit. 

Then we have the quieter, more hidden failures that look fine at every individual turn, and then go wrong somewhere between. 

Here are a few patterns we keep seeing:

  1. Context build-up

The agent makes a correct decision on turn 1 based on what the customer said. By turn 4 the customer has either clarified, corrected, or changed their mind, which means the same decision now needs different reasoning. The agent reuses its earlier reasoning anyway: fluent, on-brand, confident - and wrong.

  1. Tool parameters from stale context

The agent calls a tool correctly on turn 2. The customer pivots to a related-but-different request a few turns later. The agent calls another tool, and silently passes parameters from earlier in the conversation. This is the failure that looks fine in the transcript and breaks the customer's order in the database. We saw a version of this last week with GPT-5.4-mini.

  1. Lost the thread

The customer mentioned order NS-28479 in turn 2 and asked about returns policy in turn 4. By turn 6 the agent is answering a query regarding a different order, or treating something it said earlier as something the customer said. Every turn looks coherent on its own, but the thread doesn't.

The problem with these patterns above is that none of these show up on a prompt eval because of structural reasons. 

A prompt eval has one input and checks one output. A multi-turn failure depends on the path the conversation takes, and the path is produced by the agent's own behavior. You can't pre-script it, and if you could, the agent would just follow your script.

This is where multi-turn testing has to be simulated. There needs to be something that plays the user dynamically, in response to whatever the agent does.

That's what Voxli does. You write a single instruction, e.g. what the user is trying to do, like check a shipment status, and an AI tester drives the conversation. 

You can apply a personality to change how that user communicates: frustrated, one-word answers, language-mixing. Different personalities often lead a different path through the conversation, potentially picking up issues which otherwise wouldn’t be seen. And finally, Assertions read the full transcript after it ends, including every tool call and the parameters it was given.

Prompt evals stay useful for what they're for, multi-turn simulations catch what they can't.

So we remind you: Agents must be tested.

CTA Image

Test for multi-turn scenarios with Voxli and never miss a failure.

Start your demo