Related papers: Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design

Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design

URL: http://arxiv.org/abs/2511.20048v1
Date: Tue, 25 Nov 2025 08:15:17 GMT
Title: Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design
Authors: Zixiao Huang, Wen Zeng, Tianyu Fu, Tengxuan Liu, Yizhou Sun, Ke Hong, Xinhao Yang, Chengchun Liu, Yan Li, Quanlu Zhang, Guohao Dai, Zhenhua Zhu, Yu Wang,
Abstract summary: LLM-based search agents achieve strong performance but suffer from severe latency.<n>We revisit this bottleneck through the lens of speculation.<n>We present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency.
Score: 35.95362310928356
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM-based search agents achieve strong performance but suffer from severe latency, as each step requires serialized LLM reasoning followed by action of tool execution. We revisit this bottleneck through the lens of speculation. While traditional predict-verify speculation paradigm can break serial execution, its benefit remains limited, as it retains the full original workload and adds extra inference overhead. We observe that early agent steps often involve simple evidence-gathering, where correct actions can often be predicted without full reasoning. Building on these observations, we present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency. Algorithmically, SPAgent introduces a two-phase adaptive speculation mechanism that selectively omits verification when safe. System-wise, a two-level scheduler regulates speculative requests based on engine load to ensure speculation remains beneficial. We implement SPAgent in real-world systems. Across extensive experimental settings, SPAgent achieves up to $1.65\times$ end-to-end speedup while maintaining same or even achieving higher accuracy, enabling practical deployment of multi-step search agents.

Related papers

Agentic Spatio-Temporal Grounding via Collaborative Reasoning [80.83158605034465]
Temporal Video Grounding aims to retrieve thetemporal tube of a target object or person in a video given a text query.<n>We propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario.<n>Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs)<n>Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin
arXiv Detail & Related papers (2026-02-10T10:16:27Z)
DLLM Agent: See Farther, Run Faster [94.74432470237817]
Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties.<n>We study this in a controlled setting by instantiatingDLLM and AR backbones within the same agent workflow.<n>We find thatDLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup.
arXiv Detail & Related papers (2026-02-07T09:01:18Z)
DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents [31.08047797205678]
Diffusion Large Language Models (dLLLLMs) have demonstrated unique efficiency advantages, enabled by their inherently parallel decoding mechanism and flexible generation paradigm.<n>Despite the rapid advancement of Search Agents, their practical deployment is constrained by a fundamental limitation termed as 1) Challenge: the serial execution of multi-round reasoning, tool calling, and tool response waiting under the ReAct agent paradigm.<n>In this paper, we propose an optimization framework for dLLM-based Search Agents.
arXiv Detail & Related papers (2026-02-03T09:12:08Z)
Accelerating Multi-modal LLM Gaming Performance via Input Prediction and Mishit Correction [4.323124094061299]
Real-time sequential control agents are often bottlenecked by inference latency.<n>We propose a framework that adapts the predict-then-verify philosophy of speculative execution to model-based control with TD-MPC2.<n>We show that our method reduces the number of planning inferences from 500 to 282, improves end-to-end step latency by 25 percent, and maintains strong control performance with only a 7.1 percent return reduction.
arXiv Detail & Related papers (2025-12-19T05:34:52Z)
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
Speculative Actions: A Lossless Framework for Faster Agentic Systems [6.708126506152481]
Execution of AI agents is often slow, hampering training, evaluation, and deployment.<n>Inspired by speculative execution in microprocessors, we propose a framework that predicts likely actions using faster models.<n>We evaluate this framework across three agentic environments: gaming, e-commerce, web search, and a "lossy" extension for an operating systems environment.
arXiv Detail & Related papers (2025-10-05T21:28:11Z)
Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z)
$\ exttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z)
Speeding up Speculative Decoding via Sequential Approximate Verification [7.754712828900729]
Speculative Decoding (SD) is a recently proposed technique for faster inference using Large Language Models (LLMs)<n>We propose SPRINTER, which utilizes a low-complexity verifier trained to predict if tokens generated from a draft LLM would be accepted by the target LLM.<n>By performing sequential approximate verification, SPRINTER does not require verification by the target LLM and is only invoked when a token is deemed unacceptable.
arXiv Detail & Related papers (2025-02-06T23:10:53Z)
Efficient Inference for Large Language Model-based Generative Recommendation [78.38878421030522]
Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly.<n>Applying Speculative Decoding (SD) to generative recommendation presents unique challenges due to the requirement of generating top-K items.<n>We propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification.
arXiv Detail & Related papers (2024-10-07T16:23:36Z)
TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput [37.56866491624234]
Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving.<n>We present TurboSpec, a speculation control system that automatically profiles the execution environment.<n>We demonstrate its effectiveness across diverse workloads and hardware configurations.
arXiv Detail & Related papers (2024-06-20T07:43:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.