Related papers: Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding

Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding

URL: http://arxiv.org/abs/2408.00264v1
Date: Thu, 1 Aug 2024 03:43:32 GMT
Title: Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
Authors: Bin Xiao, Lujun Gui, Lei Su, Weipeng Chen,
Abstract summary: Regressive lightweight speculative decoding has garnered attention for its notable efficiency improvements in text generation tasks. Clover-2 is an RNN-based draft model designed to achieve comparable accuracy to that of attention decoder layer models.
Score: 8.046705062670096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) frequently suffer from inefficiencies, largely attributable to the discord between the requirements of auto-regressive decoding and the architecture of contemporary GPUs. Recently, regressive lightweight speculative decoding has garnered attention for its notable efficiency improvements in text generation tasks. This approach utilizes a lightweight regressive draft model, like a Recurrent Neural Network (RNN) or a single transformer decoder layer, leveraging sequential information to iteratively predict potential tokens. Specifically, RNN draft models are computationally economical but tend to deliver lower accuracy, while attention decoder layer models exhibit the opposite traits. This paper presents Clover-2, an advanced iteration of Clover, an RNN-based draft model designed to achieve comparable accuracy to that of attention decoder layer models while maintaining minimal computational overhead. Clover-2 enhances the model architecture and incorporates knowledge distillation to increase Clover's accuracy and improve overall efficiency. We conducted experiments using the open-source Vicuna 7B and LLaMA3-Instruct 8B models. The results demonstrate that Clover-2 surpasses existing methods across various model architectures, showcasing its efficacy and robustness.

Related papers

Dynamic layer selection in decoder-only transformers [21.18795712840146]
We empirically examine two common dynamic inference methods for natural language generation. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping. We also show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains.
arXiv Detail & Related papers (2024-10-26T00:44:11Z)
ContextGS: Compact 3D Gaussian Splatting with Anchor Level Context Model [77.71796503321632]
We introduce a context model in the anchor level for 3DGS representation, yielding an impressive size reduction of over 100 times compared to vanilla 3DGS. Our work pioneers the context model in the anchor level for 3DGS representation, yielding an impressive size reduction of over 100 times compared to vanilla 3DGS and 15 times compared to the most recent state-of-the-art work Scaffold-GS.
arXiv Detail & Related papers (2024-05-31T09:23:39Z)
Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image. The proposed approach reduces expected encoder computational cost while maintaining performance. It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z)
Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion. In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z)
Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining [61.09807522366773]
We introduce an algorithm that approximates the softmax with provable bounds and that dynamically maintains the tree. In our study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining.
arXiv Detail & Related papers (2023-03-27T15:18:32Z)
TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders using Hierarchical Maps Distillation [16.04961815178485]
We propose a lightweight model that employs multiple simple heterogeneous decoders. Our approach achieves saliency prediction accuracy on par or better than state-of-the-art methods.
arXiv Detail & Related papers (2023-01-11T18:20:19Z)
3DVerifier: Efficient Robustness Verification for 3D Point Cloud Models [17.487852393066458]
Existing verification method for point cloud model is time-expensive and computationally unattainable on large networks. We propose 3DVerifier to tackle both challenges by adopting a linear relaxation function to bound the multiplication layer and combining forward and backward propagation. Our approach achieves an orders-of-magnitude improvement in verification efficiency for the large network, and the obtained certified bounds are also significantly tighter than the state-of-the-art verifiers.
arXiv Detail & Related papers (2022-07-15T15:31:16Z)
Load-balanced Gather-scatter Patterns for Sparse Deep Neural Networks [20.374784902476318]
Pruning, as a method to introduce zeros to model weights, has shown to be an effective method to provide good trade-offs between model accuracy and computation efficiency. Some modern processors are equipped with fast on-chip scratchpad memories and gather/scatter engines that perform indirect load and store operations on such memories. In this work, we propose a set of novel sparse patterns, named gather-scatter (GS) patterns, to utilize the scratchpad memories and gather/scatter engines to speed up neural network inferences.
arXiv Detail & Related papers (2021-12-20T22:55:45Z)
a novel attention-based network for fast salient object detection [14.246237737452105]
In the current salient object detection network, the most popular method is using U-shape structure. We propose a new deep convolution network architecture with three contributions. Results demonstrate that the proposed method can compress the model to 1/3 of the original size nearly without losing the accuracy.
arXiv Detail & Related papers (2021-12-20T12:30:20Z)
Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics. Recent neural implicit modeling methods show promising results on synthetic or dense datasets. But, they perform poorly on real-world data that is sparse and noisy. This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z)
Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference [119.19779637025444]
Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images) This paper studies multi-exit networks associated with input-adaptive inference, showing their strong promise in achieving a "sweet point" in cooptimizing model accuracy, robustness and efficiency.
arXiv Detail & Related papers (2020-02-24T00:40:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.