Related papers: Evaluating Long-Context Reasoning in LLM-Based WebAgents

Evaluating Long-Context Reasoning in LLM-Based WebAgents

URL: http://arxiv.org/abs/2512.04307v1
Date: Wed, 03 Dec 2025 22:53:10 GMT
Title: Evaluating Long-Context Reasoning in LLM-Based WebAgents
Authors: Andy Chung, Yichi Zhang, Kaixiang Lin, Aditya Rawal, Qiaozi Gao, Joyce Chai,
Abstract summary: This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents.<n>We observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50% in baseline conditions to less than 10% in long context scenarios.<n>Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives.
Score: 22.264781808930948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50\% in baseline conditions to less than 10\% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.

Related papers

Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO [19.784541601653128]
Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits.<n>We propose a novel long-horizon framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization.
arXiv Detail & Related papers (2026-02-09T11:32:02Z)
IntentRL: Training Proactive User-intent Agents for Open-ended Deep Research via Reinforcement Learning [54.21689544323704]
Deep Research (DR) agents extend Large Language Models (LLMs) beyond parametric knowledge.<n>Unlike real-time conversational assistants, DR is computationally expensive and time-consuming.<n>We propose IntentRL, a framework that trains proactive agents to clarify latent user intents before starting long-horizon research.
arXiv Detail & Related papers (2026-02-03T12:43:09Z)
ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support [11.480342895892404]
Large Language Models (LLMs) have shown strong potential as conversational agents.<n>Yet, their effectiveness remains limited by deficiencies in robust long-term memory.<n> ES-MemEval is a benchmark that systematically evaluates five core memory capabilities.<n>EvoEmo is a dataset for personalized long-term emotional support.
arXiv Detail & Related papers (2026-02-02T09:58:26Z)
AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts [78.33143446024485]
We introduce textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles.<n>This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios.
arXiv Detail & Related papers (2026-01-28T16:05:44Z)
Long-term Task-oriented Agent: Proactive Long-term Intent Maintenance in Dynamic Environments [8.937298475124484]
Current large language model agents operate under a reactive paradigm, responding only to immediate user queries within short-term sessions.<n>We propose a novel interaction paradigm for proactive Task-oriented Agents capable of bridging the gap between relatively static user's needs and a dynamic environment.<n>We introduce a high-quality data synthesis pipeline to construct complex, multi-turn dialog data in a dynamic environment.
arXiv Detail & Related papers (2026-01-14T11:15:40Z)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z)
AgentFold: Long-Horizon Web Agents with Proactive Context Management [98.54523771369018]
LLM-based web agents show immense promise for information seeking, yet their effectiveness is hindered by a fundamental trade-off in context management.<n>We introduce AgentFold, a novel agent paradigm centered on proactive context management.<n>With simple supervised fine-tuning, our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH.
arXiv Detail & Related papers (2025-10-28T17:51:50Z)
COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context [17.575806280348797]
Small errors compound across steps, and even state-of-the-art models often hallucinate or lose coherence.<n>We propose a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components.
arXiv Detail & Related papers (2025-10-09T20:14:26Z)
BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions [33.59162905707337]
Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to- tasks, but real-world database applications predominantly require multi-turn interactions.<n>Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations.<n>We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a
arXiv Detail & Related papers (2025-10-06T19:31:47Z)
NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)<n>It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.<n>The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z)
Evaluating Very Long-Term Conversational Memory of LLM Agents [95.84027826745609]
We introduce a machine-human pipeline to generate high-quality, very long-term dialogues. We equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency.
arXiv Detail & Related papers (2024-02-27T18:42:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.