The multi-turn failures that prompt evals can't see
Most agent failures we see in pilots don't show up on prompt evals. Hallucinations are a genuine phenomenon and warrant testing - we also wrote about how Voxli detects them in a previous post- however, they are a visible failure which are more often than not found screenshotted