Related papers: Scaling Laws for Speculative Decoding

Scaling Laws for Speculative Decoding

URL: http://arxiv.org/abs/2505.07858v1
Date: Thu, 08 May 2025 11:10:15 GMT
Title: Scaling Laws for Speculative Decoding
Authors: Siyuan Yan, Mo Zhu, Guo-qing Jiang, Jianfei Wang, Jiaxing Chen, Wentai Zhang, Xiang Liao, Xiao Cui, Chen Zhang, Zhuoran Song, Ran Zhu,
Abstract summary: This study investigates speculative decoding techniques through dense language models (LLMs)<n>We discover Log-linear Scaling Laws (Theorem 1.1, 1.2 and 1.3) governing draft model acceptance rate (or decoding speed) across three dimensions.<n>We achieve Scylla, which coordinates multi-dimensional scaling for popular LLMs.
Score: 7.618542964397237
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The escalating demand for efficient decoding in large language models (LLMs) is particularly critical for reasoning-intensive architectures like OpenAI-o3 and DeepSeek-R1, which depend on extended chain-of-thought reasoning. This study investigates speculative decoding techniques through dense LLM architectures to establish foundational insights for accelerating reasoning tasks. While speculative decoding methods leveraging parallel draft-verification cycles have emerged as promising acceleration techniques, the scaling laws governing decoding efficiency remain under-explored compared to conventional backbone LLMs developed through Pretraining->SFT->RLHF training paradigms. In this work, we discover Log-linear Scaling Laws (Theorem 1.1, 1.2 and 1.3) governing draft model acceptance rate (or decoding speed) across three dimensions: pretraining token volume, draft model capacity, and decoding batch size. Building on these laws, we achieve Scylla, which coordinates multi-dimensional scaling for popular LLMs (Llama2/3, Qwen2.5). Empirical validation shows Scylla achieves 1.5-2.2 higher acceptance rate than EAGLE2 and 0.3 higher than EAGLE3 at temperature T = 0, with peak performance gains on summarization and QA tasks (Figure 2). Industrial inference engine deployments demonstrate 2X decoding throughput improvements over EAGLE2 (Table 5), validating the transformative potential of systematic scaling for efficient LLM inference. Code will be released later.

Related papers

OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique [59.18475981916166]
We introduce OpenCodeReasoning-II, a dataset consisting of 2.5M question-solution-critique triples (approx. 35K unique programming questions)<n>In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance.
arXiv Detail & Related papers (2025-07-11T23:35:54Z)
KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities [2.4874078867686085]
KeyKnowledgeRAG (K2RAG) is a novel framework designed to overcome limitations in RAG implementations.<n>It integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency.<n>K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers.
arXiv Detail & Related papers (2025-07-10T12:19:03Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models [12.98064202867316]
Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge.<n>We investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements.<n>We introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale.
arXiv Detail & Related papers (2025-06-28T22:13:43Z)
Can Reasoning Models Reason about Hardware? An Agentic HLS Perspective [18.791753740931185]
OpenAI o3-mini and DeepSeek-R1 use enhanced reasoning through Chain-of-Thought (CoT)<n>This paper investigates whether reasoning LLMs can address challenges in High-Level Synthesis (HLS) design space exploration and optimization.
arXiv Detail & Related papers (2025-03-17T01:21:39Z)
Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language.<n>LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint.<n>New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [76.59316249991657]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.<n>While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.<n>We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
Efficient Inference for Large Language Model-based Generative Recommendation [78.38878421030522]
Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly.<n>Applying Speculative Decoding (SD) to generative recommendation presents unique challenges due to the requirement of generating top-K items.<n>We propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification.
arXiv Detail & Related papers (2024-10-07T16:23:36Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs [44.03692512352445]
Column-Level Adaptive weight Quantization (CLAQ) is a novel and effective framework for Large Language Models (LLMs) quantization. In this paper, we present a novel and effective CLAQ framework by introducing three different types of adaptive strategies for LLM quantization. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings.
arXiv Detail & Related papers (2024-05-27T14:49:39Z)
Temporal Scaling Law for Large Language Models [57.83580734589091]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up.<n>In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position.<n>We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z)
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.