Related papers: Adapting by Pruning: A Case Study on BERT

Adapting by Pruning: A Case Study on BERT

URL: http://arxiv.org/abs/2105.03343v1
Date: Fri, 7 May 2021 15:51:08 GMT
Title: Adapting by Pruning: A Case Study on BERT
Authors: Yang Gao and Nicolo Colombo and Wei Wang
Abstract summary: We propose a novel model adaptation paradigm, adapting by pruning, which prunes neural connections in the pre-trained model to optimise the performance on the target task. We formulate adapting-by-pruning as an optimisation problem with a differentiable loss and propose an efficient algorithm to prune the model. Results suggest that our method can prune up to 50% weights in BERT while yielding similar performance compared to the fine-tuned full model.
Score: 9.963251767416967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adapting pre-trained neural models to downstream tasks has become the standard practice for obtaining high-quality models. In this work, we propose a novel model adaptation paradigm, adapting by pruning, which prunes neural connections in the pre-trained model to optimise the performance on the target task; all remaining connections have their weights intact. We formulate adapting-by-pruning as an optimisation problem with a differentiable loss and propose an efficient algorithm to prune the model. We prove that the algorithm is near-optimal under standard assumptions and apply the algorithm to adapt BERT to some GLUE tasks. Results suggest that our method can prune up to 50% weights in BERT while yielding similar performance compared to the fine-tuned full model. We also compare our method with other state-of-the-art pruning methods and study the topological differences of their obtained sub-networks.

Related papers

Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation [67.80294336559574]
Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios.<n>We propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk.
arXiv Detail & Related papers (2025-06-23T18:17:39Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval [0.7646713951724011]
Existing approaches either fine-tune the pre-trained model itself or, more efficiently, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval.
arXiv Detail & Related papers (2024-09-04T00:10:36Z)
Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT) We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z)
Adaptive Sparse Gaussian Process [0.0]
We propose the first adaptive sparse Gaussian Process (GP) able to address all these issues. We first reformulate a variational sparse GP algorithm to make it adaptive through a forgetting factor. We then propose updating a single inducing point of the sparse GP model together with the remaining model parameters every time a new sample arrives.
arXiv Detail & Related papers (2023-02-20T21:34:36Z)
Faster Adaptive Federated Learning [84.38913517122619]
Federated learning has attracted increasing attention with the emergence of distributed data. In this paper, we propose an efficient adaptive algorithm (i.e., FAFED) based on momentum-based variance reduced technique in cross-silo FL.
arXiv Detail & Related papers (2022-12-02T05:07:50Z)
Robust Binary Models by Pruning Randomly-initialized Networks [57.03100916030444]
We propose ways to obtain robust models against adversarial attacks from randomly-d binary networks. We learn the structure of the robust model by pruning a randomly-d binary network. Our method confirms the strong lottery ticket hypothesis in the presence of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T00:05:08Z)
Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models. Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z)
Adaptive Sampling for Minimax Fair Classification [40.936345085421955]
We propose an adaptive sampling algorithm based on the principle of optimism, and derive theoretical bounds on its performance. By deriving algorithm independent lower-bounds for a specific class of problems, we show that the performance achieved by our adaptive scheme cannot be improved in general.
arXiv Detail & Related papers (2021-03-01T04:58:27Z)
Evolutionary Variational Optimization of Generative Models [0.0]
We combine two popular optimization approaches to derive learning algorithms for generative models: variational optimization and evolutionary algorithms. We show that evolutionary algorithms can effectively and efficiently optimize the variational bound. In the category of "zero-shot" learning, we observed the evolutionary variational algorithm to significantly improve the state-of-the-art in many benchmark settings.
arXiv Detail & Related papers (2020-12-22T19:06:33Z)
Neural Model-based Optimization with Right-Censored Observations [42.530925002607376]
Neural networks (NNs) have been demonstrated to work well at the core of model-based optimization procedures. We show that our trained regression models achieve a better predictive quality than several baselines.
arXiv Detail & Related papers (2020-09-29T07:32:30Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.