Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
- URL: http://arxiv.org/abs/2410.21035v1
- Date: Mon, 28 Oct 2024 13:56:30 GMT
- Title: Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
- Authors: Justin Deschenaux, Caglar Gulcehre,
- Abstract summary: diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark.
Even without caching, our models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV caching.
- Score: 1.5858234832499587
- License:
- Abstract: Autoregressive (AR) Large Language Models (LLMs) have demonstrated significant success across numerous tasks. However, the AR modeling paradigm presents certain limitations; for instance, contemporary autoregressive LLMs are trained to generate one token at a time, which can result in noticeable latency. Recent advances have indicated that search and repeated sampling can enhance performance in various applications, such as theorem proving, code generation, and alignment, by utilizing greater computational resources during inference. In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the number of inference steps by a factor of 32-64. Practically, our models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV caching, and we anticipate further improvements with the inclusion of caching. Moreover, we demonstrate the efficacy of our approach for diffusion language models with up to 860M parameters.
Related papers
- Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.
Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.
A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.
An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z) - Scaling Diffusion Language Models via Adaptation from Autoregressive Models [105.70889434492143]
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling.
We show that we can convert AR models ranging from 127M to 7B parameters into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training.
Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts.
arXiv Detail & Related papers (2024-10-23T14:04:22Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.
LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.76times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z) - EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models.
We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness.
Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z) - Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures [12.703947839247693]
Diffusion models, emerging as powerful deep generative tools, excel in various applications.
However, their remarkable generative performance is hindered by slow training and sampling.
This is due to the necessity of tracking extensive forward and reverse diffusion trajectories.
We present a multi-stage framework inspired by our empirical findings to tackle these challenges.
arXiv Detail & Related papers (2023-12-14T17:48:09Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.