Related papers: COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

URL: http://arxiv.org/abs/2210.15523v1
Date: Thu, 27 Oct 2022 15:06:40 GMT
Title: COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models
Authors: Bowen Shen, Zheng Lin, Yuanxin Liu, Zhengxiao Liu, Lei Wang, Weiping Wang
Abstract summary: Transformer-based pre-trained language models (PLMs) mostly suffer from excessive overhead despite their advanced capacity. Existing statically compressed models are unaware of the diverse complexities between input instances. We propose a collaborative optimization for PLMs that integrates static model compression and dynamic inference acceleration.
Score: 16.586312156966635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based pre-trained language models (PLMs) mostly suffer from excessive overhead despite their advanced capacity. For resource-constrained devices, there is an urgent need for a spatially and temporally efficient model which retains the major capacity of PLMs. However, existing statically compressed models are unaware of the diverse complexities between input instances, potentially resulting in redundancy and inadequacy for simple and complex inputs. Also, miniature models with early exiting encounter challenges in the trade-off between making predictions and serving the deeper layers. Motivated by such considerations, we propose a collaborative optimization for PLMs that integrates static model compression and dynamic inference acceleration. Specifically, the PLM is slenderized in width while the depth remains intact, complementing layer-wise early exiting to speed up inference dynamically. To address the trade-off of early exiting, we propose a joint training approach that calibrates slenderization and preserves contributive structures to each exit instead of only the final layer. Experiments are conducted on GLUE benchmark and the results verify the Pareto optimality of our approach at high compression and acceleration rate with 1/8 parameters and 1/19 FLOPs of BERT.

Related papers

Beyond Real Weights: Hypercomplex Representations for Stable Quantization [6.708338010963415]
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations.<n>We introduce a progressive re parameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks.<n>A residual schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training.
arXiv Detail & Related papers (2025-12-09T12:10:57Z)
Towards Efficient General Feature Prediction in Masked Skeleton Modeling [59.46799426434277]
We propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling.<n>Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations.
arXiv Detail & Related papers (2025-09-03T18:05:02Z)
QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z)
Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production [0.0]
We introduce a hybrid approach combining autoregressive and diffusion models to generate Sign Language Production (SLP) models.<n>To capture fine-grained body movements, we design a Multi-Scale Pose Representation module that separately extracts detailed features from distinct arttors.<n>We also introduce a Confidence-Aware Causal Attention mechanism that utilizes joint-level confidence scores to dynamically guide the pose generation process.
arXiv Detail & Related papers (2025-07-12T01:34:50Z)
FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression [15.784158079414235]
FLAT-LLM is a training-free structural compression method based on fine-grained low-rank transformations in the activation space.<n>It achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes.
arXiv Detail & Related papers (2025-05-29T19:42:35Z)
Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models [31.103832542711864]
Balcony is a framework for depth-based dynamic inference. It maintains the full model's performance while enabling real-time adaptation to different computational budgets. Remarkably, we show that Balcony outperforms state-of-the-art methods such as Flextron and Layerskip.
arXiv Detail & Related papers (2025-03-06T22:09:55Z)
DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models [50.32663816994459]
Diffusion-styled Preference Optimization (model) provides an efficient and policy-agnostic solution for aligning LLMs with humans. modelavoids the time latency associated with token-level generation. Experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that modelachieves superior alignment performance across various settings.
arXiv Detail & Related papers (2025-03-06T09:21:54Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
SAFE: Slow and Fast Parameter-Efficient Tuning for Continual Learning with Pre-Trained Models [26.484208658326857]
Continual learning aims to incrementally acquire new concepts in data streams while resisting forgetting previous knowledge. With the rise of powerful pre-trained models (PTMs), there is a growing interest in training incremental learning systems.
arXiv Detail & Related papers (2024-11-04T15:34:30Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
Adaptive Sparse Trainer (AST) is a novel and efficient retraining framework tailored for semi-structured sparse models. AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively.
arXiv Detail & Related papers (2024-07-30T06:33:44Z)
Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising. We introduce a novel quantization framework that includes three strategies. This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Preparing Lessons for Progressive Training on Language Models [75.88952808979087]
The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions. We propose Apollo, which preptextbfares lessons for extextbfpanding textbfoperations by textbflayer functitextbfonality during training of low layers. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models.
arXiv Detail & Related papers (2024-01-17T13:04:14Z)
Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared Pre-trained Language Models [109.06052781040916]
We introduce a technique to enhance the inference efficiency of parameter-shared language models. We also propose a simple pre-training technique that leads to fully or partially shared models. Results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs.
arXiv Detail & Related papers (2023-10-19T15:13:58Z)
Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging [24.64264715041198]
We introduce Sparse Model Soups (SMS), a novel method for merging sparse models by initiating each prune-retrain cycle with the averaged model from the previous phase. SMS preserves sparsity, exploits sparse network benefits, is modular and fully parallelizable, and substantially improves IMP's performance.
arXiv Detail & Related papers (2023-06-29T08:49:41Z)
Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale. We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z)
Evolve Smoothly, Fit Consistently: Learning Smooth Latent Dynamics For Advection-Dominated Systems [14.553972457854517]
We present a data-driven, space-time continuous framework to learn surrogatemodels for complex physical systems. We leverage the expressive power of the network and aspecially designed consistency-inducing regularization to obtain latent trajectories that are both low-dimensional and smooth.
arXiv Detail & Related papers (2023-01-25T03:06:03Z)
Accelerating Pre-trained Language Models via Calibrated Cascade [37.00619245086208]
We analyze the working mechanism of dynamic early exiting and find it cannot achieve a satisfying trade-off between inference speed and performance. We propose CascadeBERT, which dynamically selects a proper-sized, complete model in a cascading manner.
arXiv Detail & Related papers (2020-12-29T09:43:50Z)
BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model. Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.