What Limits Agentic Systems Efficiency?
- URL: http://arxiv.org/abs/2510.16276v1
- Date: Sat, 18 Oct 2025 00:21:45 GMT
- Title: What Limits Agentic Systems Efficiency?
- Authors: Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, Shivaram Venkataraman,
- Abstract summary: Existing research predominantly focuses on reasoning performance, often neglecting the efficiency of agentic systems.<n>We decompose end-to-end latency into two primary components: API latency and web environment latency.<n>We propose SpecCache, a caching framework augmented with speculative execution that can reduce web environment overhead.
- Score: 6.355808944609144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated strong reasoning capabilities. To further enhance LLM capabilities, recent agentic systems, such as Deep Research, incorporate web interactions into LLM reasoning to mitigate uncertainties and reduce potential errors. However, existing research predominantly focuses on reasoning performance, often neglecting the efficiency of agentic systems. In this work, we present a comprehensive empirical study that identifies efficiency bottlenecks in web-interactive agentic systems. We decompose end-to-end latency into two primary components: LLM API latency and web environment latency. We conduct a comprehensive empirical study across 15 models and 5 providers to demonstrate high variability in API-based agentic systems. We observe that web environment latency can contribute as much as 53.7% to the overall latency in a web-based agentic system. To improve latency, we propose SpecCache, a caching framework augmented with speculative execution that can reduce web environment overhead. Extensive evaluations on two standard benchmarks show that our approach improves the cache hit rate by up to 58x compared to a random caching strategy, while reducing web environment overhead by up to 3.2x, without degrading agentic system performance.
Related papers
- AWE: Adaptive Agents for Dynamic Web Penetration Testing [0.0]
AWE is a memory-augmented multi-agent framework for autonomous web penetration testing.<n>It embeds structured, vulnerability-specific analysis pipelines within a lightweight LLM orchestration layer.<n>AWE achieves substantial gains on injection-class vulnerabilities.
arXiv Detail & Related papers (2026-03-01T07:32:42Z) - Agentic Test-Time Scaling for WebAgents [65.5178428849495]
We present Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious.<n>CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling.
arXiv Detail & Related papers (2026-02-12T18:58:30Z) - DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents [31.08047797205678]
Diffusion Large Language Models (dLLLLMs) have demonstrated unique efficiency advantages, enabled by their inherently parallel decoding mechanism and flexible generation paradigm.<n>Despite the rapid advancement of Search Agents, their practical deployment is constrained by a fundamental limitation termed as 1) Challenge: the serial execution of multi-round reasoning, tool calling, and tool response waiting under the ReAct agent paradigm.<n>In this paper, we propose an optimization framework for dLLM-based Search Agents.
arXiv Detail & Related papers (2026-02-03T09:12:08Z) - An Empirical Evaluation of LLM-Based Approaches for Code Vulnerability Detection: RAG, SFT, and Dual-Agent Systems [1.5216960763930782]
The rapid advancement of Large Language Models (LLMs) presents new opportunities for automated software vulnerability detection.<n>This paper presents a comparative study on the effectiveness of LLM-based techniques for detecting software vulnerabilities.
arXiv Detail & Related papers (2026-01-01T08:05:51Z) - Towards Efficient Agents: A Co-Design of Inference Architecture and System [66.59916327634639]
This paper presents AgentInfer, a unified framework for end-to-end agent acceleration.<n>We decompose the problem into four synergistic components: AgentCollab, AgentSched, AgentSAM, and AgentCompress.<n>Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%.
arXiv Detail & Related papers (2025-12-20T12:06:13Z) - Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z) - WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking [60.35109192765302]
Information seeking is a core capability that enables autonomous reasoning and decision-making.<n>We propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories.<n>Our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.
arXiv Detail & Related papers (2025-10-28T17:51:42Z) - Speculative Actions: A Lossless Framework for Faster Agentic Systems [6.708126506152481]
Execution of AI agents is often slow, hampering training, evaluation, and deployment.<n>Inspired by speculative execution in microprocessors, we propose a framework that predicts likely actions using faster models.<n>We evaluate this framework across three agentic environments: gaming, e-commerce, web search, and a "lossy" extension for an operating systems environment.
arXiv Detail & Related papers (2025-10-05T21:28:11Z) - WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback [78.55946306325914]
We identify key reasoning skills essential for effective web agents.<n>We reconstruct the agent's reasoning algorithms into chain-of-thought rationales.<n>Our approach yields significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2025-05-26T14:03:37Z) - Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs [48.653022530291494]
Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks.<n>This work presents the first systematic study of this latency quality trade off in real time decision making tasks.<n>We propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands.
arXiv Detail & Related papers (2025-05-26T04:03:48Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - DeLag: Using Multi-Objective Optimization to Enhance the Detection of
Latency Degradation Patterns in Service-based Systems [0.76146285961466]
We present DeLag, a novel automated search-based approach for diagnosing performance issues in service-based systems.
DeLag simultaneously searches for multiple latency patterns while optimizing precision, recall and dissimilarity.
arXiv Detail & Related papers (2021-10-21T13:59:32Z) - Accelerating Deep Learning Inference via Learned Caches [11.617579969991294]
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems.
Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads.
We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
arXiv Detail & Related papers (2021-01-18T22:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.