Related papers: $C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

URL: http://arxiv.org/abs/2505.18746v4
Date: Fri, 27 Jun 2025 03:58:25 GMT
Title: $C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking
Authors: Peijie Yu, Yifan Yang, Jinjian Li, Zelong Zhang, Haorui Wang, Xiao Feng, Feng Zhang,
Abstract summary: We present an open-source benchmark $C3$-Bench to assess agent robustness.<n>In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths.<n>In essence, $C3$-Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance.
Score: 12.218102495632937
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark $C^3$-Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing these challenges, we introduce fine-grained metrics, innovative data collection algorithms and reproducible evaluation methods. Extensive experiments are conducted on 49 mainstream agents, encompassing general fast-thinking, slow-thinking and domain-specific models. We observe that agents have significant shortcomings in handling tool dependencies, long context information dependencies and frequent policy-type switching. In essence, $C^3$-Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance. The benchmark is publicly available at https://github.com/TencentHunyuan/C3-Benchmark.

Related papers

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories [18.129031749321058]
Large Language Model (LLM)-based agents are increasingly employed to automate complex software engineering tasks.<n>Despite their widespread adoption, the internal decision-making processes of these agents remain largely unexplored.<n>We present a large-scale empirical study of the thought-action-result trajectories of three state-of-the-art LLM-based agents.
arXiv Detail & Related papers (2025-06-23T16:34:52Z)
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks [94.19506319646376]
We introduce Agent-X, a benchmark for evaluating vision-centric agents in real-world, multimodal settings.<n>Agent-X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text.<n>Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks.
arXiv Detail & Related papers (2025-05-30T17:59:53Z)
O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering [31.38063794496179]
O$2$-Searcher is a novel search agent leveraging reinforcement learning to tackle both open-ended and closed-ended questions in the open domain.<n>It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies.<n>Extensive experiments show that O$2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$2$-QA.
arXiv Detail & Related papers (2025-05-22T12:17:13Z)
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation [56.87891213797931]
We present MTR-Bench for Large Language Models' Multi-Turn Reasoning evaluation.<n>Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities.<n>MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations.
arXiv Detail & Related papers (2025-05-21T17:59:12Z)
Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions [12.218102495632937]
Large language models (LLMs) demonstrate strong potential as agents for tool invocation due to their advanced comprehension and planning capabilities.<n>We propose the Multi-Mission Tool Bench. In the benchmark, each test case comprises multiple interrelated missions.<n>We also propose a novel method to evaluate the accuracy and efficiency of agent decisions with dynamic decision trees.
arXiv Detail & Related papers (2025-04-03T14:21:33Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools.<n>Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
Multi-Agent Actor-Critic Generative AI for Query Resolution and Analysis [1.0124625066746598]
We introduce MASQRAD, a transformative framework for query resolution based on the actor-critic model.<n> MASQRAD is excellent at translating imprecise or ambiguous user inquiries into precise and actionable requests.<n> MASQRAD functions as a sophisticated multi-agent system but "masquerades" to users as a single AI entity.
arXiv Detail & Related papers (2025-02-17T04:03:15Z)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
Smartphone agents are increasingly important for helping users control devices efficiently.<n>We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents.
arXiv Detail & Related papers (2024-10-19T17:28:48Z)
Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework. At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence. We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z)
Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions [68.92637077909693]
This paper investigates the faithfulness of multimodal large language model (MLLM) agents in the graphical user interface (GUI) environment. A general setting is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions.
arXiv Detail & Related papers (2024-08-05T15:16:22Z)
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments [93.94020724735199]
HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind. This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines.
arXiv Detail & Related papers (2024-01-23T18:59:43Z)
Towards Robust Multi-Modal Reasoning via Model Selection [7.6621866737827045]
LLM serves as the "brain" of the agent, orchestrating multiple tools for collaborative multi-step task solving. We propose the $textitM3$ framework as a plug-in with negligible runtime overhead at test-time. Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies.
arXiv Detail & Related papers (2023-10-12T16:06:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.