Related papers: The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management

The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management

URL: http://arxiv.org/abs/2508.21433v3
Date: Mon, 27 Oct 2025 15:08:54 GMT
Title: The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management
Authors: Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, Yaroslav Zharov,
Abstract summary: Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use.<n>We present a systematic comparison of these approaches within SWE-agent on SWE-bench Verified.<n>We find that a simple environment observation masking strategy halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization.
Score: 2.582081036460148
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering (SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these approaches within SWE-agent on SWE-bench Verified across five diverse model configurations. Moreover, we show initial evidence of our findings generalizing to the OpenHands agent scaffold. We find that a simple environment observation masking strategy halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. Additionally, we introduce a novel hybrid approach that further reduces costs by 7% and 11% compared to just observation masking or LLM summarization, respectively. Our findings raise concerns regarding the trend towards pure LLM summarization and provide initial evidence of untapped cost reductions by pushing the efficiency-effectiveness frontier. We release code and data for reproducibility.

Related papers

RelayLLM: Efficient Reasoning via Collaborative Decoding [23.351598429979024]
RelayLLM is a novel framework for efficient reasoning via token-level collaborative decoding.<n>We show that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models.
arXiv Detail & Related papers (2026-01-08T17:56:16Z)
URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding [55.45331924836242]
We present URaG, a framework that Unifies Retrieval and Generation within a single MLLM.<n>We show that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%.
arXiv Detail & Related papers (2025-11-13T17:54:09Z)
LimRank: Less is More for Reasoning-Intensive Information Reranking [58.32304478331711]
Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks.<n>In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision.
arXiv Detail & Related papers (2025-10-27T17:19:37Z)
Adaptive Reasoning Executor: A Collaborative Agent System for Efficient Reasoning [21.75018489673356]
Chain-of-thought prompting and deep reasoning substantially enhance performance on complex tasks.<n>Applying deep reasoning to all problems is computationally expensive.<n>We propose a complementary agent system integrating small and large Large Language Models.
arXiv Detail & Related papers (2025-10-15T06:59:07Z)
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents [60.881609323604685]
Agent Synth is a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets.<n>Our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations.
arXiv Detail & Related papers (2025-06-17T05:46:52Z)
Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers [74.17516978246152]
Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques.<n>We propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds.<n>Experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines.
arXiv Detail & Related papers (2025-05-26T15:27:55Z)
Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [55.044159987218436]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z)
ConSol: Sequential Probability Ratio Testing to Find Consistent LLM Reasoning Paths Efficiently [3.6393221632527686]
Small language models (LLMs) solve complex tasks by generating intermediate reasoning steps prior to providing answers.<n>The widely-used self-consistency method further exacerbates these costs by aggregating multiple reasoning paths to improve accuracy.<n>We propose leveraging Sequential Probability Ratio Testing (SPRT) to dynamically terminate sampling once sufficient consistency is achieved.
arXiv Detail & Related papers (2025-03-22T00:07:28Z)
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [65.23593936798662]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z)
Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG) We compute a factuality score that can be thresholded to yield a binary decision. Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z)
Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling [38.7578639980701]
Self-improvement methods enable large language models to generate solutions themselves.<n>We find that models tend to over-sample on easy queries and under-sample on queries they have yet to master.<n>We introduce Guided Self-Improvement (GSI), a strategy aimed at improving the efficiency of sampling challenging heavy-tailed data.
arXiv Detail & Related papers (2024-11-01T17:18:45Z)
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations. Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
More Agents Is All You Need [16.372072265248192]
We find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated.
arXiv Detail & Related papers (2024-02-03T05:55:24Z)
Revisiting Large Language Models as Zero-shot Relation Extractors [8.953462875381888]
Relation extraction (RE) consistently involves a certain degree of labeled or unlabeled data even if under zero-shot setting. Recent studies have shown that large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt. This work focuses on the study of exploring LLMs as zero-shot relation extractors.
arXiv Detail & Related papers (2023-10-08T06:17:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.