Collaborative Speculative Inference for Efficient LLM Inference Serving
- URL: http://arxiv.org/abs/2503.10325v2
- Date: Thu, 15 May 2025 12:02:56 GMT
- Title: Collaborative Speculative Inference for Efficient LLM Inference Serving
- Authors: Luyao Gao, Jianchun Liu, Hongli Xu, Xichong Zhang, Yunming Liao, Liusheng Huang,
- Abstract summary: CoSine is a novel speculative inference system that decouples sequential speculative decoding from parallel verification.<n>With equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.
- Score: 22.3288901160418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the efficiency of inference serving by reducing LLM inference latency and costs while preserving generation quality. However, existing speculative methods face critical challenges, including inefficient resource utilization and limited draft acceptance, which constrain their scalability and overall effectiveness. To overcome these obstacles, we present CoSine, a novel speculative inference system that decouples sequential speculative decoding from parallel verification, enabling efficient collaboration among multiple nodes. Specifically, CoSine routes inference requests to specialized drafters based on their expertise and incorporates a confidence-based token fusion mechanism to synthesize outputs from cooperating drafters, ensuring high-quality draft generation. Additionally, CoSine dynamically orchestrates the execution of speculative decoding and verification in a pipelined manner, employing batch scheduling to selectively group requests and adaptive speculation control to minimize idle periods. By optimizing parallel workflows through heterogeneous node collaboration, CoSine balances draft generation and verification throughput in real-time, thereby maximizing resource utilization. Experimental results demonstrate that CoSine achieves superior performance compared to state-of-the-art speculative approaches. Notably, with equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.
Related papers
- DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling [20.605487145370752]
Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation.<n>Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints.<n>We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework.
arXiv Detail & Related papers (2025-06-19T05:40:54Z) - S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models [38.784951111677856]
Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks.<n>Their autoregressive nature leads to substantial latency inference, posing challenges for real-time applications.<n>We propose a Speculative Sampling with Syntactic and Semantic Coherence framework, which extends speculative sampling by leveraging multi-head drafting.
arXiv Detail & Related papers (2025-06-17T03:38:19Z) - The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z) - Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.
This paper formulates LLM inference optimization as a multi-stage online scheduling problem.
We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z) - TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding [76.23719557942917]
TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel.<n>We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens.
arXiv Detail & Related papers (2025-02-21T04:19:24Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation [27.484259938667776]
Large Language Models excel at code generation yet struggle with complex programming tasks that demand reasoning.<n>We introduce Outcome Refining Process Supervision, which unifies process and outcome supervision by leveraging executable verification.<n>Experiments across 5 models and 3 benchmarks show consistent gains, with 26.9% higher correctness and 42.2% improved code efficiency.
arXiv Detail & Related papers (2024-12-19T17:59:42Z) - CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing [70.25689961697523]
We propose a generalizable algorithm that enhances sequential reasoning by cross-task experience sharing and selection.
Our work bridges the gap between existing sequential reasoning paradigms and validates the effectiveness of leveraging cross-task experiences.
arXiv Detail & Related papers (2024-10-22T03:59:53Z) - Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation [8.046705062670096]
Lossless speculative decoding accelerates target large language model inference.
We propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding) to boost speculative decoding.
Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series.
arXiv Detail & Related papers (2024-08-28T06:28:01Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.<n>Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.<n>This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z) - An Auction-based Coordination Strategy for Task-Constrained Multi-Agent
Stochastic Planning with Submodular Rewards [7.419725234099728]
Existing task coordination algorithms either ignore the process or suffer from the computational intensity.
We propose a decentralized auction-based coordination strategy using a newly formulated score function.
For the implementation on large-scale applications, an approximate variant of the proposed method, namely Deep Auction, is also suggested.
arXiv Detail & Related papers (2022-12-30T10:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.