Voxli

Latest — 27 Apr 2026

The multi-turn failures that prompt evals can't see

Most agent failures we see in pilots don't show up on prompt evals. Hallucinations are a genuine phenomenon and warrant testing - we also wrote about how Voxli detects them in a previous post- however, they are a visible failure which are more often than not found screenshotted

More issues

The 10-minute test that stops your agent from canceling real orders

We discovered last week that a failed tool call would cause GPT-5.4-mini to cancel a real order simply because a customer asked a question involving "order cancellation." Following our mantra of always test, we're offering a quick 10-minute Voxli test today that will help rectify

Expertise.ai teams up with Voxli to solve the "absolute insanity" of their AI sales Agent testing workflow

Expertise.ai is a known disruptor in the AI space, building AI sales agents that deliver contextual, real-time conversations that guide prospects through personalized sales flows rather than static decision trees. But as Expertise.ai was pushing the limits to improve the response quality of their AI sales agents, they

The failed Tool Call when Simulating a Customer Conversation Across Three LLMs

Recently, to assess AI Agent performance with tool calls, we executed the same multi-turn conversation across the three tiers of OpenAI's GPT-5.4: standard, mini, and nano. Our findings should make any seasoned AI developer nervous, especially if you're ‌making changes to your agent without proper

Testing for Speculation using Voxli

In our last post, we discussed the Risks of Agent Speculation. Today we will look at how you can set up Voxli to catch speculations, using a feature called Hallucination detection. Activating hallucination detection prompts Voxli to review agent dialogue and tool selections. Voxli then extracts the claims that the

The Risks of Agent Speculation

It’s no surprise that hallucinations are a common known failure during agentic AI testing. The agent starts to overpromise, begins to fabricate answers and even claims that it has taken action by stating it has ‘escalated to support’ - even when it has not. All agent builders know to