Related papers: AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

URL: http://arxiv.org/abs/2601.20730v2
Date: Thu, 29 Jan 2026 12:32:51 GMT
Title: AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts
Authors: Shicheng Fang, Yuxin Wang, XiaoRan Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng, Xuanjing Huang, Xipeng Qiu,
Abstract summary: We introduce textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles.<n>This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios.
Score: 78.33143446024485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.

Related papers

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories [52.57197752244638]
We introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task.<n>Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues.<n>We construct DISBench, a challenging benchmark built on interconnected visual data.
arXiv Detail & Related papers (2026-02-11T12:51:10Z)
AMA: Adaptive Memory via Multi-Agent Collaboration [54.490349689939166]
We propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities.<n>AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods.
arXiv Detail & Related papers (2026-01-28T08:09:49Z)
RAGShaper: Eliciting Sophisticated Agentic RAG Skills via Automated Data Synthesis [29.39426376890088]
Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving.<n>We introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories.
arXiv Detail & Related papers (2026-01-13T16:25:07Z)
Evaluating Long-Context Reasoning in LLM-Based WebAgents [22.264781808930948]
This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents.<n>We observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50% in baseline conditions to less than 10% in long context scenarios.<n>Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives.
arXiv Detail & Related papers (2025-12-03T22:53:10Z)
ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction [84.90394416593624]
Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions.<n>Existing simulation-based data generation methods rely heavily on costly autoregressive interactions between multiple agents.<n>We propose a novel Non-Autoregressive Iterative Generation framework, called ToolACE-MT, for constructing high-quality multi-turn agentic dialogues.
arXiv Detail & Related papers (2025-08-18T07:38:23Z)
FindingDory: A Benchmark to Evaluate Memory in Embodied Agents [49.18498389833308]
We introduce a new benchmark for long-range embodied tasks in the Habitat simulator.<n>This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness.
arXiv Detail & Related papers (2025-06-18T17:06:28Z)
$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking [12.218102495632937]
We present an open-source benchmark $C3$-Bench to assess agent robustness.<n>In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths.<n>In essence, $C3$-Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance.
arXiv Detail & Related papers (2025-05-24T15:25:44Z)
NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)<n>It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.<n>The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.