Expertise.ai teams up with Voxli to solve the "absolute insanity" of their AI sales Agent testing workflow

Expertise.ai teams up with Voxli to solve the "absolute insanity" of their AI sales Agent testing workflow

Expertise.ai is a known disruptor in the AI space, building AI sales agents that deliver contextual, real-time conversations that guide prospects through personalized sales flows rather than static decision trees.

But as Expertise.ai was pushing the limits to improve the response quality of their AI sales agents, they found testing agents for different customers - each with their own vector database and advanced prompts - came with many complications, especially for multi-turn conversations.

Here is how a timely message from the founders of Voxli.io helped them out.

Expertise’s challenges with their agents and testing solutions

The team at Expertise found that in certain circumstances, their AI sales agents would skip questions or get confused, creating a potential brand risk for themselves and for customers. 

Ammar Khan, a builder at Expertise, shares his experiences when trying to mitigate the problem: “Engineers were putting in the same question again and again, hoping the issue would show up. You'd ship a fix, then run it 10 times hoping it doesn't come back. It was absolute insanity…Some bugs only reproduced under specific customer bot configurations.”

Ammar highlighted a specific demo where the agent failed to identify the current city mayor, as it could not distinguish between the current and former mayors within the RAG context. 

"Engineers were putting in the same question again and again, hoping the issue would show up. You'd ship a fix, then run it 10 times hoping it doesn't come back. It was absolute insanity." — Ammar Khan, Expertise.ai

After spending hours without finding a proper solution, Ammar landed on building a custom solution: a script that duplicated customer agents and ran tests 30 times for statistical significance.

The custom approach helped, but had clear limitations:

  • The unoptimized testing infrastructure could take an hour to run
  • Setting up a single new test took between 10-30 minutes
  • Engineers shied away from writing tests because of complexity
  • Despite seeing value, the engineering team were reluctant to adopt the testing strategy

How Voxli helped Expertise.ai automate regression tests

That's when Voxli reached out. After understanding Expertise's setup, the team integrated Voxli with their agent in a speedy onboarding. 

According to Ryan Hoffman, Engineer at Expertise, when a customer reports incorrect responses, the development team connects to Voxli via an MCP interface and work directly in their development environment. Through the MCP connection within the IDE, Ryan was able to use Cursor to iteratively run the tests in Voxli, allowing for code changes, test reruns, and checks for regressions and other issues.

This allowed them to quickly diagnose the problem by creating a baseline of 10 repeated tests based on examples provided by customers. This process helped them confirm the issue and determine the cause of test failures, such as the system retrieving incorrect information from the vector database (RAG) or unexpected behavior from the base prompt.

According to Ryan, the new Voxli environment enabled rapid resolution of incorrect customer responses, thanks to its smooth integration with several IDEs which he tested himself, including VS Code, Cursor, and Codex (with Claude Code also supported).

“The feedback loop is super fast.” He shares. “I can create tests, see what went wrong, tweak them - that would have taken hours before. Now it just kind of goes."

Expertise’s upwards trajectory of efficiency 

Now with Voxli, a full suite of tests running in parallel is up to 10 times faster and takes minutes instead of hours, which is a significant increase in speed compared to the prior hour-long run time for ~20 tests. In addition, Ammar reports a "magnitude improvement" in both the time required to run tests, and the time required to set up tests. 

Ryan's accelerated pace caught the attention of Expertise's CEO, who directly tasked him to lead the company's agent quality initiative, a role that emerged because Voxli’s tooling made rapid improvement possible.

Where previously tests would have taken hours to complete, the team now runs the test suites daily at midnight to check for regressions if a change has been pushed.

"The feedback loop is super fast. I can create tests, see what went wrong, tweak them - that would have taken hours before. Now it just kind of goes." — Ryan Hoffman, Engineer, Expertise.ai

And the mayor bug from earlier? Voxli caught it, the team patched it, and a regression test now prevents it from coming back.

The future of regression testing for Expertise.ai

Ammar affirms Voxli’s long-term value, stating that “Voxli becomes more valuable the longer it is used because of the accumulation of comprehensive regression tests that act as a proprietary source of advantage.” Similarly, Ryan states he is utilizing Voxli as an “issue detector” to diagnose problems during iteration loops, allowing for prompt tweaking and that setting up tests now is “simple and fun.”

Looking ahead to new projects, the team is now feeling confident and now have the responsibility of redesigning the chat structure for Expertise. If Voxli's tests pass, the team feels they can confidently deploy their new structure and continue improving their AI Agent.

CTA Image

"Voxli becomes more valuable the longer you've used it. The tests you build become a proprietary source of advantage — a custom dataset your competitors don't have." — Ammar Khan, Expertise.ai

Try out Voxli today