On the Provable Advantage of Unsupervised Pretraining
- URL: http://arxiv.org/abs/2303.01566v1
- Date: Thu, 2 Mar 2023 20:42:05 GMT
- Title: On the Provable Advantage of Unsupervised Pretraining
- Authors: Jiawei Ge, Shange Tang, Jianqing Fan, Chi Jin
- Abstract summary: Unsupervised pretraining is a critical component of modern large-scale machine learning systems.
This paper studies a generic framework, where the unsupervised representation learning task is specified by an abstract class of latent variable models.
Under a mild ''informative'' condition, our algorithm achieves an excess risk of $tildemathcalO(sqrtmathcalC_Phi/m + sqrtmathcalC_Psi/n)$ for downstream tasks.
- Score: 26.065736182939222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised pretraining, which learns a useful representation using a large
amount of unlabeled data to facilitate the learning of downstream tasks, is a
critical component of modern large-scale machine learning systems. Despite its
tremendous empirical success, the rigorous theoretical understanding of why
unsupervised pretraining generally helps remains rather limited -- most
existing results are restricted to particular methods or approaches for
unsupervised pretraining with specialized structural assumptions. This paper
studies a generic framework, where the unsupervised representation learning
task is specified by an abstract class of latent variable models $\Phi$ and the
downstream task is specified by a class of prediction functions $\Psi$. We
consider a natural approach of using Maximum Likelihood Estimation (MLE) for
unsupervised pretraining and Empirical Risk Minimization (ERM) for learning
downstream tasks. We prove that, under a mild ''informative'' condition, our
algorithm achieves an excess risk of
$\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_\Phi/m} + \sqrt{\mathcal{C}_\Psi/n})$
for downstream tasks, where $\mathcal{C}_\Phi, \mathcal{C}_\Psi$ are complexity
measures of function classes $\Phi, \Psi$, and $m, n$ are the number of
unlabeled and labeled data respectively. Comparing to the baseline of
$\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_{\Phi \circ \Psi}/n})$ achieved by
performing supervised learning using only the labeled data, our result
rigorously shows the benefit of unsupervised pretraining when $m \gg n$ and
$\mathcal{C}_{\Phi\circ \Psi} > \mathcal{C}_\Psi$. This paper further shows
that our generic framework covers a wide range of approaches for unsupervised
pretraining, including factor models, Gaussian mixture models, and contrastive
learning.
Related papers
- Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg.
The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment.
Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z) - Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks.
This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z) - Blessing of Class Diversity in Pre-training [54.335530406959435]
We prove that when the classes of the pre-training task are sufficiently diverse, pre-training can significantly improve the sample efficiency of downstream tasks.
Our proof relies on a vector-form Rademacher complexity chain rule for composite function classes and a modified self-concordance condition.
arXiv Detail & Related papers (2022-09-07T20:10:12Z) - An Information-Theoretic Analysis for Transfer Learning: Error Bounds
and Applications [5.081241420920605]
We give an information-theoretic analysis on the generalization error and excess risk of transfer learning algorithms.
Our results suggest, perhaps as expected, that the Kullback-Leibler divergence $D(mu||mu')$ plays an important role in the characterizations.
Inspired by the derived bounds, we propose the InfoBoost algorithm in which the importance weights for source and target data are adjusted adaptively.
arXiv Detail & Related papers (2022-07-12T08:20:41Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Learning to extrapolate using continued fractions: Predicting the
critical temperature of superconductor materials [5.905364646955811]
In the field of Artificial Intelligence (AI) and Machine Learning (ML), the approximation of unknown target functions $y=f(mathbfx)$ is a common objective.
We refer to $S$ as the training set and aim to identify a low-complexity mathematical model that can effectively approximate this target function for new instances $mathbfx$.
arXiv Detail & Related papers (2020-11-27T04:57:40Z) - For self-supervised learning, Rationality implies generalization,
provably [13.526562756159809]
We prove a new upper bound on the generalization gap of classifiers obtained by first using self-supervision.
We show that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2020-10-16T17:07:52Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z) - Weighted Empirical Risk Minimization: Sample Selection Bias Correction
based on Importance Sampling [2.599882743586164]
We consider statistical learning problems when the distribution $P'$ of the training observations $Z'_i$ differs from the distribution $P'$ involved in the risk one seeks to minimize.
We show that, in various situations frequently encountered in practice, it takes a simple form and can be directly estimated from the $Phi(z)$.
We then prove that the capacity generalization of the approach aforementioned is preserved when plugging the resulting estimates of the $Phi(Z'_i)$'s into the weighted empirical risk.
arXiv Detail & Related papers (2020-02-12T18:42:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.