Related papers: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

URL: http://arxiv.org/abs/2505.18878v1
Date: Sat, 24 May 2025 21:33:22 GMT
Title: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
Authors: Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, Chien-Sheng Wu,
Abstract summary: CRMArena-Pro is a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings.<n>It incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments.<n>Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings.
Score: 85.88573535033406
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

Related papers

Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding [59.50808215134678]
This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs.<n>Results reveal significant limitations in dynamic scene comprehension, cross-modal resilience and real-world risk mitigation.
arXiv Detail & Related papers (2025-06-14T04:04:54Z)
ProRefine: Inference-Time Prompt Refinement with Textual Feedback [8.261243439474322]
AgenticRefine, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, play a substantial role in many cutting-edge commercial applications.<n>We introduce ProRefine, an innovative inference-time optimization method that uses an agentic loop of LLMs to generate and apply textual feedback.<n>ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points.
arXiv Detail & Related papers (2025-06-05T17:52:30Z)
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation [23.204532296472834]
We introduce OpenDeception, a novel deception evaluation framework with an open-ended scenario dataset.<n>OpenDeception jointly evaluates both the deception intention and capabilities of LLM-based agents by inspecting their internal reasoning process.<n>To avoid ethical concerns and costs of high-risk deceptive interactions with human testers, we propose to simulate the multi-turn dialogue via agent simulation.
arXiv Detail & Related papers (2025-04-18T14:11:27Z)
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments [90.29937153770835]
We introduce CRMArena, a benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments.<n>We show that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities.<n>Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments.
arXiv Detail & Related papers (2024-11-04T17:30:51Z)
Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms.<n>Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations.<n>Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z)
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models [51.19622266249408]
MultiTrust is the first comprehensive and unified benchmark on the trustworthiness of MLLMs.<n>Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts.<n>Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks.
arXiv Detail & Related papers (2024-06-11T08:38:13Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.