Related papers: Information-Theoretic Bounds on Transfer Generalization Gap Based on Jensen-Shannon Divergence

Information-Theoretic Bounds on Transfer Generalization Gap Based on Jensen-Shannon Divergence

URL: http://arxiv.org/abs/2010.09484v4
Date: Mon, 25 Jan 2021 04:49:57 GMT
Title: Information-Theoretic Bounds on Transfer Generalization Gap Based on Jensen-Shannon Divergence
Authors: Sharu Theresa Jose, Osvaldo Simeone
Abstract summary: In transfer learning, training and testing data sets are drawn from different data distributions. This work presents novel information-theoretic upper bounds on the average transfer generalization gap.
Score: 42.275148861039895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In transfer learning, training and testing data sets are drawn from different data distributions. The transfer generalization gap is the difference between the population loss on the target data distribution and the training loss. The training data set generally includes data drawn from both source and target distributions. This work presents novel information-theoretic upper bounds on the average transfer generalization gap that capture $(i)$ the domain shift between the target data distribution $P'_Z$ and the source distribution $P_Z$ through a two-parameter family of generalized $(\alpha_1,\alpha_2)$-Jensen-Shannon (JS) divergences; and $(ii)$ the sensitivity of the transfer learner output $W$ to each individual sample of the data set $Z_i$ via the mutual information $I(W;Z_i)$. For $\alpha_1 \in (0,1)$, the $(\alpha_1,\alpha_2)$-JS divergence can be bounded even when the support of $P_Z$ is not included in that of $P'_Z$. This contrasts the Kullback-Leibler (KL) divergence $D_{KL}(P_Z||P'_Z)$-based bounds of Wu et al. [1], which are vacuous under this assumption. Moreover, the obtained bounds hold for unbounded loss functions with bounded cumulant generating functions, unlike the $\phi$-divergence based bound of Wu et al. [1]. We also obtain new upper bounds on the average transfer excess risk in terms of the $(\alpha_1,\alpha_2)$-JS divergence for empirical weighted risk minimization (EWRM), which minimizes the weighted average training losses over source and target data sets. Finally, we provide a numerical example to illustrate the merits of the introduced bounds.

Related papers

Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction [57.93371273485736]
We consider a centralized distributed learning setup where all workers jointly find an unbiased bound LDeltaepsilon2,$ better poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution.
arXiv Detail & Related papers (2025-06-30T13:27:39Z)
Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures [8.343111115184591]
Given finite-dimensional random vectors $Y$, $X$, and $Z$, we derive upper bounds on the excess minimum risk.<n>The excess minimum risk is defined as the difference between the minimum expected loss in estimating $Y$ from $X$ and from $Z$.<n>We present a family of bounds that generalize the mutual information based bound of Gy"orfi et al.
arXiv Detail & Related papers (2025-05-30T01:28:18Z)
Sum-of-squares lower bounds for Non-Gaussian Component Analysis [33.80749804695003]
Non-Gaussian Component Analysis (NGCA) is the statistical task of finding a non-Gaussian direction in a high-dimensional dataset. Here we study the complexity of NGCA in the Sum-of-Squares framework.
arXiv Detail & Related papers (2024-10-28T18:19:13Z)
Gradual Domain Adaptation via Manifold-Constrained Distributionally Robust Optimization [0.4732176352681218]
This paper addresses the challenge of gradual domain adaptation within a class of manifold-constrained data distributions. We propose a methodology rooted in Distributionally Robust Optimization (DRO) with an adaptive Wasserstein radius. Our bounds rely on a newly introduced it compatibility measure, which fully characterizes the error propagation dynamics along the sequence.
arXiv Detail & Related papers (2024-10-17T22:07:25Z)
Distribution-Aware Mean Estimation under User-level Local Differential Privacy [5.267844649650687]
We consider the problem of mean estimation under user-level local differential privacy, where $n$ users are contributing through their local pool of data samples. Based on a distribution-aware mean estimation algorithm, we establish an $M$-dependent upper bounds on the worst-case risk over $mu$ for the task of mean estimation.
arXiv Detail & Related papers (2024-10-12T11:57:52Z)
Bounds on $L_p$ Errors in Density Ratio Estimation via $f$-Divergence Loss Functions [0.0]
Density ratio estimation (DRE) is a fundamental machine learning technique for identifying relationships between two probability distributions. $f$-divergence loss functions, derived from variational representations of $f$-divergence, are commonly employed in DRE to achieve state-of-the-art results. This study presents a novel perspective on DRE using $f$-divergence loss functions by deriving the upper and lower bounds on $L_p$ errors.
arXiv Detail & Related papers (2024-10-02T13:05:09Z)
Statistical Error Bounds for GANs with Nonlinear Objective Functionals [5.022028859839544]
We derive statistical error bounds for $(f,Gamma)$-GANs for general classes of $f$ and $Gamma$ in the form of finite-sample concentration inequalities. Results prove the statistical consistency of $(f,Gamma)$-GANs and reduce to the known results for IPM-GANs in the appropriate limit.
arXiv Detail & Related papers (2024-06-24T17:42:03Z)
Statistical Efficiency of Distributional Temporal Difference Learning and Freedman's Inequality in Hilbert Spaces [24.03281329962804]
In this paper, we focus on the non-asymptotic statistical rates of distributional temporal difference learning. We show that for NTD with a generative model, we need $tildeO(varepsilon-2 mu_pi,min-1 (1-gamma)-3+t_mixmu_pi,min-1 (1-gamma)-1)$ sample complexity bounds in the case of the $1$-Wasserstein distance. We establish a novel Freedman's inequality
arXiv Detail & Related papers (2024-03-09T06:19:53Z)
Data Structures for Density Estimation [66.36971978162461]
Given a sublinear (in $n$) number of samples from $p$, our main result is the first data structure that identifies $v_i$ in time sublinear in $k$. We also give an improved version of the algorithm of Acharya et al. that reports $v_i$ in time linear in $k$.
arXiv Detail & Related papers (2023-06-20T06:13:56Z)
How Does Pseudo-Labeling Affect the Generalization Error of the Semi-Supervised Gibbs Algorithm? [73.80001705134147]
We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset.
arXiv Detail & Related papers (2022-10-15T04:11:56Z)
Learning Algorithm Generalization Error Bounds via Auxiliary Distributions [16.44492672878356]
Generalization error bounds are essential for comprehending how well machine learning models work. In this work, we suggest a novel method, i.e., the Auxiliary Distribution Method, that leads to new upper bounds on expected generalization errors.
arXiv Detail & Related papers (2022-10-02T10:37:04Z)
The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift [127.21287240963859]
We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data. For a large class of linear regression instances, transfer learning with $O(N2)$ source data is as effective as supervised learning with $N$ target data.
arXiv Detail & Related papers (2022-08-03T05:59:49Z)
Non-Gaussian Component Analysis via Lattice Basis Reduction [56.98280399449707]
Non-Gaussian Component Analysis (NGCA) is a distribution learning problem. We provide an efficient algorithm for NGCA in the regime that $A$ is discrete or nearly discrete.
arXiv Detail & Related papers (2021-12-16T18:38:02Z)
A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian Kernel, a Precise Phase Transition, and the Corresponding Double Descent [85.77233010209368]
This article characterizes the exacts of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples $n$ is all large and comparable. This analysis also provides accurate estimates of training and test regression errors for large $n,p,N$.
arXiv Detail & Related papers (2020-06-09T02:05:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.