Surprisingly enough, it seems that some AI agents are not entirely up to scratching some basic business tests

Salesforce Research finds a single rotal assignment See only 58% success while efficiency with multiple turns drops to 35%
Reasoning models such as Gemini-2.5-Pro tend to surpass lighter models
CRMARENA-PRO has proven to be a challenging benchmark

Researchers from Salesforce AI Research have introduced a new benchmark-crmarena-pro use synthetic company data to access the LLM agent’s performance in difference CRM scenarios.

It found that LLM agents achieved about 58% success on tasks that can be completed in a single step, with tasks that require more interactions that fall into efficiency to only 35% – just under more than one in three.

Although models such as Gemini-2.5-Pro achieved over 83% success in the execution of workflow, Salesforce researchers still highlighted some concerns with AI agents, which suggest that they may not be up to the bottom.

Are AI agents actually so good?

The paper, entitled ‘Holistic assessment of LLM agents across different business scenarios and interactions’, explained that LLM agents who appear almost zero inherent confidentiality awareness, noting that their performance in handling sensitive information is only improved with explicit incentive (which often came at the expense of the task’s success).

They also criticized earlier and existing benchmarks for not catching interactions with multiple turns, addressing B2B scenarios or confidentiality and reflecting realistic data environment. CRMARENA-PRO is built on synthetic data validated by CRM experts covering B2B and B2C settings.

In terms of analysis results, reasoning models such as Gemini-2.5-Pro and O1 over lighter models exceeded the most-Salesforce researchers concluded that models seeking multiple clarifications generally work better, especially in multi-turning tasks.

For example, while the average performance across the nine tested models (three each from Openai, Google and Meta) resulted in a score of 35.1%, the Gemini-2.5-Pro scored 54.5%.

“These findings suggest a significant gap between the current LLM capabilities and the multifaceted demands of the real world scenarios, the placement of CRMARENA-PRO as a challenging test bed for guidance on future progress in the development of more sophisticated, reliable and confidentiality-awarian LLM agents for professional use,” researchers concluded.

Looking ahead, Salesforce CEO Marc Benioff AIA agents considers an opportunity for high margin, with large corporate customers, including governments that focus on AI agents for increased efficiency and additional cost savings.

Must Read

Leave a Comment Cancel Reply