Blessing of Class Diversity in Pre-training
- URL: http://arxiv.org/abs/2209.03447v1
- Date: Wed, 7 Sep 2022 20:10:12 GMT
- Title: Blessing of Class Diversity in Pre-training
- Authors: Yulai Zhao, Jianshu Chen, Simon S. Du
- Abstract summary: We prove that when the classes of the pre-training task are sufficiently diverse, pre-training can significantly improve the sample efficiency of downstream tasks.
Our proof relies on a vector-form Rademacher complexity chain rule for composite function classes and a modified self-concordance condition.
- Score: 54.335530406959435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a new statistical analysis aiming to explain the recent
superior achievements of the pre-training techniques in natural language
processing (NLP). We prove that when the classes of the pre-training task
(e.g., different words in the masked language model task) are sufficiently
diverse, in the sense that the least singular value of the last linear layer in
pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can
significantly improve the sample efficiency of downstream tasks. Specially, we
show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu}
\sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$
rate in the standard supervised learning. Here, $n$ is the number of
pre-training data and $m$ is the number of data in the downstream task, and
typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity
chain rule for disassembling composite function classes and a modified
self-concordance condition. These techniques can be of independent interest.
Related papers
- IT$^3$: Idempotent Test-Time Training [95.78053599609044]
This paper introduces Idempotent Test-Time Training (IT$3$), a novel approach to addressing the challenge of distribution shift.
IT$3$ is based on the universal property of idempotence.
We demonstrate the versatility of our approach across various tasks, including corrupted image classification.
arXiv Detail & Related papers (2024-10-05T15:39:51Z) - Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification [7.869708570399577]
We consider a bi-objective prediction task of predicting both the conditional expectation $mathbbE[Y|X]$ and the conditional variance Var$(Y|X)$.
Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution.
arXiv Detail & Related papers (2024-05-24T00:08:55Z) - Pretraining task diversity and the emergence of non-Bayesian in-context
learning for regression [31.950737940558984]
Pretrained transformers exhibit the remarkable ability of in-context learning (ICL)
Can ICL solve fundamentally $textitnew$ tasks that are very different from those seen during pretraining?
arXiv Detail & Related papers (2023-06-26T21:05:20Z) - Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks.
This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z) - On the Provable Advantage of Unsupervised Pretraining [26.065736182939222]
Unsupervised pretraining is a critical component of modern large-scale machine learning systems.
This paper studies a generic framework, where the unsupervised representation learning task is specified by an abstract class of latent variable models.
Under a mild ''informative'' condition, our algorithm achieves an excess risk of $tildemathcalO(sqrtmathcalC_Phi/m + sqrtmathcalC_Psi/n)$ for downstream tasks.
arXiv Detail & Related papers (2023-03-02T20:42:05Z) - Improving Representational Continuity via Continued Pretraining [76.29171039601948]
Transfer learning community (LP-FT) outperforms naive training and other continual learning methods.
LP-FT also reduces forgetting in a real world satellite remote sensing dataset (FMoW)
variant of LP-FT gets state-of-the-art accuracies on an NLP continual learning benchmark.
arXiv Detail & Related papers (2023-02-26T10:39:38Z) - Mediated Uncoupled Learning: Learning Functions without Direct
Input-output Correspondences [80.95776331769899]
We consider the task of predicting $Y$ from $X$ when we have no paired data of them.
A naive approach is to predict $U$ from $X$ using $S_X$ and then $Y$ from $U$ using $S_Y$.
We propose a new method that avoids predicting $U$ but directly learns $Y = f(X)$ by training $f(X)$ with $S_X$ to predict $h(U)$.
arXiv Detail & Related papers (2021-07-16T22:13:29Z) - Coresets for Classification -- Simplified and Strengthened [19.54307474041768]
We give relative error coresets for training linear classifiers with a broad class of loss functions.
Our construction achieves $tilde O(d cdot mu_y(X)2/epsilon2)$ points, where $mu_y(X)$ is a natural complexity measure of the data matrix $X in mathbbRn times d$ and label vector $y in -1,1n$.
arXiv Detail & Related papers (2021-06-08T11:24:18Z) - Learning to extrapolate using continued fractions: Predicting the
critical temperature of superconductor materials [5.905364646955811]
In the field of Artificial Intelligence (AI) and Machine Learning (ML), the approximation of unknown target functions $y=f(mathbfx)$ is a common objective.
We refer to $S$ as the training set and aim to identify a low-complexity mathematical model that can effectively approximate this target function for new instances $mathbfx$.
arXiv Detail & Related papers (2020-11-27T04:57:40Z) - On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$.
We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.