Related papers: Flatter Tokens are More Valuable for Speculative Draft Model Training

Flatter Tokens are More Valuable for Speculative Draft Model Training

URL: http://arxiv.org/abs/2601.18902v1
Date: Mon, 26 Jan 2026 19:13:22 GMT
Title: Flatter Tokens are More Valuable for Speculative Draft Model Training
Authors: Jiaming Fan, Daming Cao, Xiangzhong Luo, Jiale Fu, Chonghan Liu, Xu Yang,
Abstract summary: Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference.<n>We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate.<n>We propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based dataset Distillation (SFDD) approach.
Score: 8.13138934199466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate. Specifically, our theoretical analysis and empirical validation reveals that tokens inducing flatter predictive distributions from the target model are more valuable than those yielding sharply peaked distributions. Based on this insight, we propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based Dataset Distillation (SFDD) approach, which filters the training data to retain only the most valuable samples. Experiments on the EAGLE framework demonstrate that SFDD can achieve over 2$\times$ training speedup using only 50% of the data, while keeping the final model's inference speedup within 4% of the full-dataset baseline. This work introduces an effective, data-centric approach that substantially improves the training efficiency for Speculative Decoding. Our code is available at https://anonymous.4open.science/r/Flatness.

Related papers

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice [109.9635246405237]
We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyper parameters.<n>We introduce a simple patch to the evaluation protocol: using reduced learning rates for proxy model training.<n> Empirically, we validate this approach across 23 data recipes covering four critical dimensions of data curation.
arXiv Detail & Related papers (2025-12-30T23:02:44Z)
Optimizing the Training Diet: Data Mixture Search for Robust Time Series Forecasting [0.8665758002017515]
We show that, in some cases, "less is more" when considering datasets.<n>We introduce a framework for discovering the optimal "training diet" from a large, unlabeled time series corpus.
arXiv Detail & Related papers (2025-12-12T13:26:07Z)
CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation [71.52209438343928]
Core Distribution Alignment (CoDA) is a framework that enables effective Distillation (DD) using only an off-the-shelf text-to-image model.<n>Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism.<n>By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics.
arXiv Detail & Related papers (2025-12-03T14:45:57Z)
Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training [34.76379453286399]
We propose a series of alternative training paradigms that leverage insights from hard-data-mining and dropout.<n>The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline.<n>Surprisingly, the proposed method improves accuracy by up to 4.82%.
arXiv Detail & Related papers (2025-05-28T13:26:52Z)
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.<n>We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z)
Dataset Ownership Verification in Contrastive Pre-trained Models [37.03747798645621]
We propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning.<n>We validate the efficacy of this approach across multiple contrastive pre-trained models including SimCLR, BYOL, SimSiam, MOCO v3, and DINO.
arXiv Detail & Related papers (2025-02-11T05:42:21Z)
Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones. We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.