Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
- URL: http://arxiv.org/abs/2406.17404v2
- Date: Sat, 05 Oct 2024 16:20:13 GMT
- Title: Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
- Authors: Yixuan Wang, Xianzhen Luo, Fuxuan Wei, Yijun Liu, Qingfu Zhu, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che,
- Abstract summary: We propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model.
The training method simply introduces some noise at the input for the model to learn the denoising task.
Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance.
- Score: 54.581599828392854
- License:
- Abstract: Existing speculative decoding methods typically require additional model structure and training processes to assist the model for draft token generation. This makes the migration of acceleration methods to the new model more costly and more demanding on device memory. To address this problem, we propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model. The training method simply introduces some noise at the input for the model to learn the denoising task. It significantly enhances the parallel decoding capability of the model without affecting the original task capability. In addition, we propose a tree-based retrieval-augmented Jacobi (TR-Jacobi) decoding strategy to further improve the inference speed of MSN models. Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance. The MSN model also achieves comparable acceleration ratios to the SOTA model with additional model structure on Spec-Bench.
Related papers
- Scalable Ensembling For Mitigating Reward Overoptimisation [24.58937616758007]
Reinforcement Learning from Human Feedback has enabled significant advancements within language modeling for powerful, instruction-following models.
The alignment of these models remains a pressing challenge as the policy tends to overfit the learned proxy" reward model past an inflection point of utility.
arXiv Detail & Related papers (2024-06-03T05:46:53Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Federated Topic Model and Model Pruning Based on Variational Autoencoder [14.737942599204064]
Federated topic modeling allows multiple parties to jointly train models while protecting data privacy.
This paper proposes a method to establish a federated topic model while ensuring the privacy of each node, and use neural network model pruning to accelerate the model.
Experimental results show that the federated topic model pruning can greatly accelerate the model training speed while ensuring the model's performance.
arXiv Detail & Related papers (2023-11-01T06:00:14Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Language Models not just for Pre-training: Fast Online Neural Noisy
Channel Modeling [35.43382144290393]
We introduce efficient approximations to make inference with the noisy channel approach as fast as strong ensembles.
We also show that the noisy channel approach can outperform strong pre-training results by achieving a new state of the art on WMT Romanian-English translation.
arXiv Detail & Related papers (2020-11-13T23:22:28Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.