Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation
- URL: http://arxiv.org/abs/2410.10141v1
- Date: Mon, 14 Oct 2024 04:17:45 GMT
- Title: Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation
- Authors: Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Donghan Yu, Jiawei Han, Yelong Shen,
- Abstract summary: This paper delves into the effects of decoding temperatures on speculative decoding's efficacy.
We first highlight the challenge of decoding at higher temperatures, and demonstrate KD in a consistent temperature setting could be a remedy.
Building upon these findings, we take an initial step to further the speedup for speculative decoding, particularly in a high-temperature generation setting.
- Score: 76.5894260737116
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speculative decoding stands as a pivotal technique to expedite inference in autoregressive (large) language models. This method employs a smaller draft model to speculate a block of tokens, which the target model then evaluates for acceptance. Despite a wealth of studies aimed at increasing the efficiency of speculative decoding, the influence of generation configurations on the decoding process remains poorly understood, especially concerning decoding temperatures. This paper delves into the effects of decoding temperatures on speculative decoding's efficacy. Beginning with knowledge distillation (KD), we first highlight the challenge of decoding at higher temperatures, and demonstrate KD in a consistent temperature setting could be a remedy. We also investigate the effects of out-of-domain testing sets with out-of-range temperatures. Building upon these findings, we take an initial step to further the speedup for speculative decoding, particularly in a high-temperature generation setting. Our work offers new insights into how generation configurations drastically affect the performance of speculative decoding, and underscores the need for developing methods that focus on diverse decoding configurations. Code is publically available at https://github.com/ozyyshr/TempSpec.
Related papers
- Adaptive Decoding via Latent Preference Optimization [55.70602730588745]
We introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time.
Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures.
arXiv Detail & Related papers (2024-11-14T18:31:39Z) - Instance Temperature Knowledge Distillation [15.095465128404161]
Existing methods dynamically adjust the temperature to enable the student network to adapt to varying learning difficulties.
We formulate the adjustment of temperature as a sequential decision-making task and propose a method based on reinforcement learning.
Our framework can serve as a plug-and-play technique to be inserted into various KD methods easily.
arXiv Detail & Related papers (2024-06-27T14:00:05Z) - Efficient Sample-Specific Encoder Perturbations [37.84914870036184]
We show that a small proxy network can be used to find a sample-by-sample perturbation of the encoder output of a frozen foundation model.
Results show consistent improvements in performance evaluated through COMET and WER respectively.
arXiv Detail & Related papers (2024-05-01T08:55:16Z) - Dynamic Temperature Knowledge Distillation [9.6046915661065]
Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD)
Traditional approaches often employ a static temperature throughout the KD process.
We propose Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously.
arXiv Detail & Related papers (2024-04-19T08:40:52Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - Testing the Accuracy of Surface Code Decoders [55.616364225463066]
Large-scale, fault-tolerant quantum computations will be enabled by quantum error-correcting codes (QECC)
This work presents the first systematic technique to test the accuracy and effectiveness of different QECC decoding schemes.
arXiv Detail & Related papers (2023-11-21T10:22:08Z) - Hot or Cold? Adaptive Temperature Sampling for Code Generation with
Large Language Models [54.72004797421481]
We conduct the first systematic study to explore a decoding strategy specialized in code generation.
Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling.
Results show that AdapT sampling significantly outperforms state-of-the-art decoding strategy.
arXiv Detail & Related papers (2023-09-06T06:27:33Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.