Intra-request branch orchestration for efficient LLM reasoning
- URL: http://arxiv.org/abs/2509.24957v1
- Date: Mon, 29 Sep 2025 15:52:08 GMT
- Title: Intra-request branch orchestration for efficient LLM reasoning
- Authors: Weifan Jiang, Rana Shahout, Yilun Du, Michael Mitzenmacher, Minlan Yu,
- Abstract summary: Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
- Score: 52.68946975865865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms such as chain-of-thought and multi-branch reasoning to improve accuracy on complex tasks. These methods, however, substantially increase token usage and per-request latency. Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors. We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions. DUCHESS employs a lightweight linear probing model over LLM layer activations to estimate branch correctness, and its orchestration policy decides whether to terminate, duplicate, or continue a branch. When handling multiple requests, DUCHESS further reduces latency by prioritizing easier reasoning tasks when complexity can be estimated from the prompt. Experiments on three reasoning benchmarks show that DUCHESS consistently improves the token-accuracy Pareto frontier, reducing token usage by 42-63% at matched accuracy compared to self-consistency. In serving with vLLM, DUCHESS reduces mean, median, and tail latencies by 57-81%, 58-85%, and 52-84% with First-Come-First-Served scheduling, and achieves additional gains under difficulty-aware scheduling at higher request rates.
Related papers
- Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models [6.002670452103349]
Large language models (LLMs) achieve state-of-the-art accuracy on complex reasoning tasks.<n>But using a fixed token budget per query leads to over-computation on easy inputs and under-computation on hard ones.<n>We introduce Predictive Scheduling, a plug-and-play framework that pre-runs lightweight predictors to estimate each query's optimal reasoning length or difficulty before any full generation.
arXiv Detail & Related papers (2026-02-01T13:58:23Z) - Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design [35.95362310928356]
LLM-based search agents achieve strong performance but suffer from severe latency.<n>We revisit this bottleneck through the lens of speculation.<n>We present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency.
arXiv Detail & Related papers (2025-11-25T08:15:17Z) - TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z) - Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling [55.026048429595384]
Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs.<n>We propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency.
arXiv Detail & Related papers (2025-11-12T13:57:43Z) - DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning [134.03095505580276]
Doing Length pEnalty Right (DLER) is a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty.<n>DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy.
arXiv Detail & Related papers (2025-10-16T20:05:57Z) - FairBatching: Fairness-Aware Batch Formation for LLM Inference [2.0917668141703207]
This work identifies the root cause of this unfairness: the non-monotonic nature of Time--Tokens (TBT)<n>We propose Fair the Prioritizing, a novel system that enforces fair resource allocation between fill and decode tasks.
arXiv Detail & Related papers (2025-10-16T07:43:56Z) - Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework [10.148124073650349]
Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs)<n>Longer outputs increase latency, memory usage, and KV-cache demands.<n>We propose SEER (Self-Enhancing Efficient Reasoning), an adaptive framework that compresses CoT while preserving accuracy.
arXiv Detail & Related papers (2025-09-17T15:33:44Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency [4.372762934308627]
We propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD)<n>LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding.<n>Experiments show that LAPS-SD reduces inference latency by approximately 39% compared to state-of-the-art scheduling methods.
arXiv Detail & Related papers (2025-05-20T04:12:37Z) - Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately [29.018731931275138]
Large Language Models (LLMs) can gain better capabilities by generating Chain-of-Thought reasoning to respond a given request.<n>However, when incorporating the two scaling dimensions, the system efficiency is dampened significantly for two reasons.<n>We present SART, a serving framework for efficient and accurate LLM reasoning.
arXiv Detail & Related papers (2025-05-19T16:34:56Z) - Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z) - SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning [14.020244011380063]
SpecReason is a system that accelerates LRM inference.<n>It exploits the semantic flexibility of thinking tokens in preserving final-answer accuracy.<n>It achieves $1.4-3.0times$ speedup over vanilla LRM inference.
arXiv Detail & Related papers (2025-04-10T16:05:19Z) - Efficiently Scaling LLM Reasoning with Certaindex [25.549811985276488]
Test-time reasoning algorithms can wastefully generate many tokens without improving accuracy.<n>We introduce Certaindex, an algorithm-agnostic metric measuring when further computation is unlikely to alter the final result.<n>Certaindex is lightweight, can accelerate reasoning program inference via early exit, and enables dynamic token allocation.
arXiv Detail & Related papers (2024-12-30T14:57:53Z) - SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [53.4441894198495]
Large language models (LLMs) now support extremely long context windows.<n>The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency.<n>We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.