Task-Robust Pre-Training for Worst-Case Downstream Adaptation
- URL: http://arxiv.org/abs/2306.12070v3
- Date: Fri, 24 Nov 2023 07:00:54 GMT
- Title: Task-Robust Pre-Training for Worst-Case Downstream Adaptation
- Authors: Jianghui Wang, Yang Chen, Xingyu Xie, Cong Fang, Zhouchen Lin
- Abstract summary: Pre-training has achieved remarkable success when transferred to downstream tasks.
This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
- Score: 62.05108162160981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training has achieved remarkable success when transferred to downstream
tasks. In machine learning, we care about not only the good performance of a
model but also its behavior under reasonable shifts of condition. The same
philosophy holds when pre-training a foundation model. However, the foundation
model may not uniformly behave well for a series of related downstream tasks.
This happens, for example, when conducting mask recovery regression where the
recovery ability or the training instances diverge like pattern features are
extracted dominantly on pre-training, but semantic features are also required
on a downstream task. This paper considers pre-training a model that guarantees
a uniformly good performance over the downstream tasks. We call this goal as
$\textit{downstream-task robustness}$. Our method first separates the upstream
task into several representative ones and applies a simple minimax loss for
pre-training. We then design an efficient algorithm to solve the minimax loss
and prove its convergence in the convex setting. In the experiments, we show
both on large-scale natural language processing and computer vision datasets
our method increases the metrics on worse-case downstream tasks. Additionally,
some theoretical explanations for why our loss is beneficial are provided.
Specifically, we show fewer samples are inherently required for the most
challenging downstream task in some cases.
Related papers
- How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - Less is More: On the Feature Redundancy of Pretrained Models When
Transferring to Few-shot Tasks [120.23328563831704]
Transferring a pretrained model to a downstream task can be as easy as conducting linear probing with target data.
We show that, for linear probing, the pretrained features can be extremely redundant when the downstream data is scarce.
arXiv Detail & Related papers (2023-10-05T19:00:49Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Voting from Nearest Tasks: Meta-Vote Pruning of Pre-trained Models for
Downstream Tasks [55.431048995662714]
We create a small model for a new task from the pruned models of similar tasks.
We show that a few fine-tuning steps on this model suffice to produce a promising pruned-model for the new task.
We develop a simple but effective ''Meta-Vote Pruning (MVP)'' method that significantly reduces the pruning iterations for a new task.
arXiv Detail & Related papers (2023-01-27T06:49:47Z) - Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models [46.24479693469042]
This paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not.
arXiv Detail & Related papers (2022-10-25T17:45:36Z) - On Transfer of Adversarial Robustness from Pretraining to Downstream
Tasks [1.8900691517352295]
We show that the robustness of a linear predictor on downstream tasks can be constrained by the robustness of its underlying representation.
Our results offer an initial step towards characterizing the requirements of the representation function for reliable post-adaptation performance.
arXiv Detail & Related papers (2022-08-07T23:00:40Z) - Rethinking supervised pre-training for better downstream transferring [46.09030708111374]
We propose a new supervised pre-training method based on Leave-One-Out K-Nearest-Neighbor, or LOOK.
It relieves the problem of overfitting upstream tasks by only requiring each image to share its class label with most of its k nearest neighbors.
We developed efficient implementation of the proposed method that scales well to large datasets.
arXiv Detail & Related papers (2021-10-12T13:57:38Z) - When does loss-based prioritization fail? [18.982933391138268]
We show that loss-based acceleration methods degrade in scenarios with noisy and corrupted data.
Measures of example difficulty need to correctly separate out noise from other types of challenging examples.
arXiv Detail & Related papers (2021-07-16T07:23:15Z) - LogME: Practical Assessment of Pre-trained Models for Transfer Learning [80.24059713295165]
The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning.
Compared to brute-force fine-tuning, LogME brings over $3000times$ speedup in wall-clock time.
arXiv Detail & Related papers (2021-02-22T13:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.