Transfer Learning for Structured Pruning under Limited Task Data
- URL: http://arxiv.org/abs/2311.06382v1
- Date: Fri, 10 Nov 2023 20:23:35 GMT
- Title: Transfer Learning for Structured Pruning under Limited Task Data
- Authors: Lucio Dery, David Grangier and Awni Hannun
- Abstract summary: We propose a framework which combines structured pruning with transfer learning to reduce the need for task-specific data.
We demonstrate that our framework results in pruned models with improved generalization over strong baselines.
- Score: 15.946734013984184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large, pre-trained models are problematic to use in resource constrained
applications. Fortunately, task-aware structured pruning methods offer a
solution. These approaches reduce model size by dropping structural units like
layers and attention heads in a manner that takes into account the end-task.
However, these pruning algorithms require more task-specific data than is
typically available. We propose a framework which combines structured pruning
with transfer learning to reduce the need for task-specific data. Our empirical
results answer questions such as: How should the two tasks be coupled? What
parameters should be transferred? And, when during training should transfer
learning be introduced? Leveraging these insights, we demonstrate that our
framework results in pruned models with improved generalization over strong
baselines.
Related papers
- Data-CUBE: Data Curriculum for Instruction-based Sentence Representation
Learning [85.66907881270785]
We propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training.
In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk.
In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training.
arXiv Detail & Related papers (2024-01-07T18:12:20Z) - Improving Long-Horizon Imitation Through Instruction Prediction [93.47416552953075]
In this work, we explore the use of an often unused source of auxiliary supervision: language.
Inspired by recent advances in transformer-based models, we train agents with an instruction prediction loss that encourages learning temporally extended representations that operate at a high level of abstraction.
In further analysis we find that instruction modeling is most important for tasks that require complex reasoning, while understandably offering smaller gains in environments that require simple plans.
arXiv Detail & Related papers (2023-06-21T20:47:23Z) - Optimal transfer protocol by incremental layer defrosting [66.76153955485584]
Transfer learning is a powerful tool enabling model training with limited amounts of data.
The simplest transfer learning protocol is based on freezing" the feature-extractor layers of a network pre-trained on a data-rich source task.
We show that this protocol is often sub-optimal and the largest performance gain may be achieved when smaller portions of the pre-trained network are kept frozen.
arXiv Detail & Related papers (2023-03-02T17:32:11Z) - Voting from Nearest Tasks: Meta-Vote Pruning of Pre-trained Models for
Downstream Tasks [55.431048995662714]
We create a small model for a new task from the pruned models of similar tasks.
We show that a few fine-tuning steps on this model suffice to produce a promising pruned-model for the new task.
We develop a simple but effective ''Meta-Vote Pruning (MVP)'' method that significantly reduces the pruning iterations for a new task.
arXiv Detail & Related papers (2023-01-27T06:49:47Z) - Unveiling Transformers with LEGO: a synthetic reasoning task [23.535488809197787]
We study how the transformer architecture learns to follow a chain of reasoning.
In some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning.
We find that one can prevent such shortcut with appropriate architecture modification or careful data preparation.
arXiv Detail & Related papers (2022-06-09T06:30:17Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers.
Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters.
We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z) - Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral [18.387162887917164]
We formulate a model-agnostic framework that performs fine-grained manipulation of the auxiliary task gradients.
We propose to decompose auxiliary updates into directions which help, damage or leave the primary task loss unchanged.
Our approach consistently outperforms strong and widely used baselines when leveraging out-of-distribution data for Text and Image classification tasks.
arXiv Detail & Related papers (2021-08-25T17:09:48Z) - Scalable Transfer Learning with Expert Models [32.48351077884257]
We explore the use of expert representations for transfer with a simple, yet effective, strategy.
We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task.
This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer.
arXiv Detail & Related papers (2020-09-28T12:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.