Related papers: On the Provable Advantage of Unsupervised Pretraining

On the Provable Advantage of Unsupervised Pretraining

URL: http://arxiv.org/abs/2303.01566v1
Date: Thu, 2 Mar 2023 20:42:05 GMT
Title: On the Provable Advantage of Unsupervised Pretraining
Authors: Jiawei Ge, Shange Tang, Jianqing Fan, Chi Jin
Abstract summary: Unsupervised pretraining is a critical component of modern large-scale machine learning systems. This paper studies a generic framework, where the unsupervised representation learning task is specified by an abstract class of latent variable models. Under a mild ''informative'' condition, our algorithm achieves an excess risk of $tildemathcalO(sqrtmathcalC_Phi/m + sqrtmathcalC_Psi/n)$ for downstream tasks.
Score: 26.065736182939222
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unsupervised pretraining, which learns a useful representation using a large amount of unlabeled data to facilitate the learning of downstream tasks, is a critical component of modern large-scale machine learning systems. Despite its tremendous empirical success, the rigorous theoretical understanding of why unsupervised pretraining generally helps remains rather limited -- most existing results are restricted to particular methods or approaches for unsupervised pretraining with specialized structural assumptions. This paper studies a generic framework, where the unsupervised representation learning task is specified by an abstract class of latent variable models $\Phi$ and the downstream task is specified by a class of prediction functions $\Psi$. We consider a natural approach of using Maximum Likelihood Estimation (MLE) for unsupervised pretraining and Empirical Risk Minimization (ERM) for learning downstream tasks. We prove that, under a mild ''informative'' condition, our algorithm achieves an excess risk of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_\Phi/m} + \sqrt{\mathcal{C}_\Psi/n})$ for downstream tasks, where $\mathcal{C}_\Phi, \mathcal{C}_\Psi$ are complexity measures of function classes $\Phi, \Psi$, and $m, n$ are the number of unlabeled and labeled data respectively. Comparing to the baseline of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_{\Phi \circ \Psi}/n})$ achieved by performing supervised learning using only the labeled data, our result rigorously shows the benefit of unsupervised pretraining when $m \gg n$ and $\mathcal{C}_{\Phi\circ \Psi} > \mathcal{C}_\Psi$. This paper further shows that our generic framework covers a wide range of approaches for unsupervised pretraining, including factor models, Gaussian mixture models, and contrastive learning.

Related papers

Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z)
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\ exttt{D}}$ual-$\mathbf{\ exttt{H}}$ead $\mathbf{\ exttt{O}}$ptimization [49.2338910653152]
Vision-constrained models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data.<n> Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning.<n>We propose $mathbftextttDHO$ -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-language settings.
arXiv Detail & Related papers (2025-05-12T15:39:51Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
On Characterizing and Mitigating Imbalances in Multi-Instance Partial Label Learning [57.18649648182171]
We make contributions towards addressing a problem that hasn't been studied so far in the context of MI-PLL. We derive class-specific risk bounds for MI-PLL, while making minimal assumptions. Our theory reveals a unique phenomenon: that $sigma$ can greatly impact learning imbalances.
arXiv Detail & Related papers (2024-07-13T20:56:34Z)
Aligning Model Properties via Conformal Risk Control [4.710921988115686]
Post-training alignment via human feedback shows promise, but is often limited to generative AI settings. In traditional non-generative settings with numerical or categorical outputs, detecting misalignment through single-sample outputs remains challenging. We propose interpreting model alignment through property testing, defining an aligned model $f$ as one belonging to a subset $mathcalP$ of functions.
arXiv Detail & Related papers (2024-06-26T22:24:46Z)
Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg. The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment. Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z)
Robust Empirical Risk Minimization with Tolerance [24.434720137937756]
We study the fundamental paradigm of (robust) $textitempirical risk minimization$ (RERM) We show that a natural tolerant variant of RERM is indeed sufficient for $gamma$-tolerant robust learning VC classes over $mathbbRd$.
arXiv Detail & Related papers (2022-10-02T21:26:15Z)
Blessing of Class Diversity in Pre-training [54.335530406959435]
We prove that when the classes of the pre-training task are sufficiently diverse, pre-training can significantly improve the sample efficiency of downstream tasks. Our proof relies on a vector-form Rademacher complexity chain rule for composite function classes and a modified self-concordance condition.
arXiv Detail & Related papers (2022-09-07T20:10:12Z)
Randomized Exploration for Reinforcement Learning with General Value Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm. Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z)
Learning to extrapolate using continued fractions: Predicting the critical temperature of superconductor materials [5.905364646955811]
In the field of Artificial Intelligence (AI) and Machine Learning (ML), the approximation of unknown target functions $y=f(mathbfx)$ is a common objective. We refer to $S$ as the training set and aim to identify a low-complexity mathematical model that can effectively approximate this target function for new instances $mathbfx$.
arXiv Detail & Related papers (2020-11-27T04:57:40Z)
For self-supervised learning, Rationality implies generalization, provably [13.526562756159809]
We prove a new upper bound on the generalization gap of classifiers obtained by first using self-supervision. We show that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2020-10-16T17:07:52Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator) We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$. We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning. We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z)
Weighted Empirical Risk Minimization: Sample Selection Bias Correction based on Importance Sampling [2.599882743586164]
We consider statistical learning problems when the distribution $P'$ of the training observations $Z'_i$ differs from the distribution $P'$ involved in the risk one seeks to minimize. We show that, in various situations frequently encountered in practice, it takes a simple form and can be directly estimated from the $Phi(z)$. We then prove that the capacity generalization of the approach aforementioned is preserved when plugging the resulting estimates of the $Phi(Z'_i)$'s into the weighted empirical risk.
arXiv Detail & Related papers (2020-02-12T18:42:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.