Related papers: LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

URL: http://arxiv.org/abs/2602.23881v1
Date: Fri, 27 Feb 2026 10:20:11 GMT
Title: LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding
Authors: Alexander Samarin, Sergei Krutikov, Anton Shevtsov, Sergei Skvortsov, Filipp Fisin, Alexander Golubev,
Abstract summary: Speculative decoding accelerates autoregressive large language model (LLM) inference.<n>Standard training minimizes Kullback-Leibler (KL) divergence as a proxy objective.<n>We propose LK losses, special training objectives that directly target acceptance rate.
Score: 67.61563011564388
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by the acceptance rate, yet standard training minimizes Kullback-Leibler (KL) divergence as a proxy objective. While KL divergence and acceptance rate share the same global optimum, small draft models, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing acceptance rate. To address this issue, we propose LK losses, special training objectives that directly target acceptance rate. Comprehensive experiments across four draft architectures and six target models, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training. We evaluate our approach on general, coding and math domains and report gains of up to 8-10% in average acceptance length. LK losses are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft training objectives.

Related papers

Learnable Chernoff Baselines for Inference-Time Alignment [64.81256817158851]
We introduce Learnable Chernoff Baselines as a method for efficiently and approximately sampling from exponentially tilted kernels.<n>We establish total-variation guarantees to the ideal aligned model, and demonstrate in both continuous and discrete diffusion settings that LCB sampling closely matches ideal rejection sampling.
arXiv Detail & Related papers (2026-02-08T00:09:40Z)
Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models [130.8912476550625]
We propose a positive-unlabeled (PU) reinforcement learning distillation method for on-premise small-model deployment.<n>Our method distills the teacher's preference-optimization capability from black-box generations into a locally trainable student.<n>Experiments demonstrate that our method achieves consistently strong performance under a low-cost setting.
arXiv Detail & Related papers (2026-01-28T15:14:50Z)
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation [71.45710345765528]
Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens.<n>But due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks.<n>We propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models.
arXiv Detail & Related papers (2025-12-04T17:50:53Z)
Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match [21.810129153556044]
Training-Free Loosely Speculative Decoding (FLy) is a novel method that loosens the rigid verification criterion.<n>We show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup.
arXiv Detail & Related papers (2025-11-28T08:23:30Z)
From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models [90.45197506653341]
Large reasoning models generate intermediate reasoning traces before producing final answers.<n> aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored.<n>A common workaround optimized a single sampled trajectory, which introduces substantial gradient variance from trace sampling.
arXiv Detail & Related papers (2025-10-06T17:58:01Z)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z)
Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment [25.988070517700848]
Speculative decoding has been proposed as a technique to accelerate autoregressive generation.<n>We show that even powerful draft models such as GPT-4o, as well as human text cannot achieve high acceptance rates.<n>We ask the following question: Can we adapt verification to recognize correct, but non-aligned replies?
arXiv Detail & Related papers (2025-01-31T17:09:53Z)
Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration [14.011702040133848]
We propose a CTC-based draft model which strengthens the correlations between draft tokens during the draft phase.<n>Experiment results show that compared to strong baselines, the proposed method can achieve a higher acceptance rate and hence a faster inference speed.
arXiv Detail & Related papers (2024-11-25T14:10:21Z)
On Divergence Measures for Training GFlowNets [3.7277730514654555]
Generative Flow Networks (GFlowNets) are amortized inference models designed to sample from unnormalized distributions over composable objects. Traditionally, the training procedure for GFlowNets seeks to minimize the expected log-squared difference between a proposal (forward policy) and a target (backward policy) distribution. We review four divergence measures, namely, Renyi-$alpha$'s, Tsallis-$alpha$'s, reverse and forward KL's, and design statistically efficient estimators for their gradients in the context of training GFlowNets
arXiv Detail & Related papers (2024-10-12T03:46:52Z)
Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory [1.5771347525430772]
A significant bottleneck in federated learning is the network communication cost of sending model updates from client devices to the central server. Our method encodes quantized updates with an appropriate universal code, taking into account their empirical distribution. Because quantization introduces error, we select quantization levels by optimizing for the desired trade-off in average total gradient and distortion.
arXiv Detail & Related papers (2022-01-07T20:17:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.