Bounds on Lp errors in density ratio estimation via f-divergence loss functions
- URL: http://arxiv.org/abs/2410.01516v2
- Date: Mon, 17 Mar 2025 03:01:25 GMT
- Title: Bounds on Lp errors in density ratio estimation via f-divergence loss functions
- Authors: Yoshiaki Kitazawa,
- Abstract summary: Density ratio estimation (DRE) is a technique used to capture relationships between two probability distributions.<n>$f$-divergence loss functions, which are derived from variational representations of $f$-divergence, have become a standard choice in DRE for achieving cutting-edge performance.<n>This study provides novel theoretical insights into DRE by deriving upper and lower bounds on the $L_p$ errors through $f$-divergence loss functions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Density ratio estimation (DRE) is a core technique in machine learning used to capture relationships between two probability distributions. $f$-divergence loss functions, which are derived from variational representations of $f$-divergence, have become a standard choice in DRE for achieving cutting-edge performance. This study provides novel theoretical insights into DRE by deriving upper and lower bounds on the $L_p$ errors through $f$-divergence loss functions. These bounds apply to any estimator belonging to a class of Lipschitz continuous estimators, irrespective of the specific $f$-divergence loss function employed. The derived bounds are expressed as a product involving the data dimensionality and the expected value of the density ratio raised to the $p$-th power. Notably, the lower bound includes an exponential term that depends on the Kullback--Leibler (KL) divergence, revealing that the $L_p$ error increases significantly as the KL divergence grows when $p > 1$. This increase becomes even more pronounced as the value of $p$ grows. The theoretical insights are validated through numerical experiments.
Related papers
- Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification [50.717692060500696]
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling.
Next-token prediction can be made robust so as to achieve $C=tilde O(H)$, representing moderate error amplification.
No computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e(log H)1-Omega(1)$.
arXiv Detail & Related papers (2025-02-18T02:52:00Z) - A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data [32.98264375121064]
We consider supervised deep learning when the given explanatory variable is distributed according to an exponential family.
Under the assumption of an upper-bounded density of the explanatory variables, we characterize the rate of convergence as $tildemathcalOleft( dfrac2lfloorbetarfloor(beta + d)2beta + dn-frac22beta + dn-frac22beta + dn-frac22beta + dn-
arXiv Detail & Related papers (2024-12-13T01:15:17Z) - A Unified Analysis for Finite Weight Averaging [50.75116992029417]
Averaging iterations of Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA)
In this paper, we generalize LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization.
arXiv Detail & Related papers (2024-11-20T10:08:22Z) - Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$.
We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z) - Robust deep learning from weakly dependent data [0.0]
This paper considers robust deep learning from weakly dependent observations, with unbounded loss function and unbounded input/output.
We derive a relationship between these bounds and $r$, and when the data have moments of any order (that is $r=infty$), the convergence rate is close to some well-known results.
arXiv Detail & Related papers (2024-05-08T14:25:40Z) - Convergence Analysis of Probability Flow ODE for Score-based Generative Models [5.939858158928473]
We study the convergence properties of deterministic samplers based on probability flow ODEs from both theoretical and numerical perspectives.
We prove the total variation between the target and the generated data distributions can be bounded above by $mathcalO(d3/4delta1/2)$ in the continuous time level.
arXiv Detail & Related papers (2024-04-15T12:29:28Z) - $α$-Divergence Loss Function for Neural Density Ratio Estimation [0.0]
Density ratio estimation (DRE) is a fundamental machine learning technique for capturing relationships between two probability distributions.
Existing methods face optimization challenges, such as overfitting due to lower-unbounded loss functions, biased mini-batch gradients, vanishing training loss gradients, and high sample requirements for Kullback-Leibler (KL) divergence loss functions.
We propose a novel loss function for DRE, the $alpha$-divergence loss function ($alpha$-Div), which is concise but offers stable and effective optimization for DRE.
arXiv Detail & Related papers (2024-02-03T05:33:01Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Kernel-based off-policy estimation without overlap: Instance optimality
beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data.
For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z) - $\alpha$-GAN: Convergence and Estimation Guarantees [7.493779672689531]
We prove a correspondence between the min-max optimization of general CPE loss function GANs and the minimization of associated $f$-divergences.
We then focus on $alpha$-GAN, defined via the $alpha$-loss, which interpolates several GANs and corresponds to the minimization of the Arimoto divergence.
arXiv Detail & Related papers (2022-05-12T23:26:51Z) - Faster Convergence of Local SGD for Over-Parameterized Models [1.5504102675587357]
Modern machine learning architectures are often highly expressive.
We analyze the convergence of Local SGD (or FedAvg) for such over-parameterized functions in heterogeneous data setting.
For general convex loss functions, we establish an error bound $O(K/T)$ otherwise.
For non-loss functions, we prove an error bound $O(K/T)$ in both cases.
We complete our results by providing problem instances in which our established convergence rates are tight to a constant factor with a reasonably small stepsize.
arXiv Detail & Related papers (2022-01-30T04:05:56Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Understanding the Under-Coverage Bias in Uncertainty Estimation [58.03725169462616]
quantile regression tends to emphunder-cover than the desired coverage level in reality.
We prove that quantile regression suffers from an inherent under-coverage bias.
Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error.
arXiv Detail & Related papers (2021-06-10T06:11:55Z) - Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z) - Rates of convergence for density estimation with generative adversarial
networks [19.71040653379663]
We prove an oracle inequality for the Jensen-Shannon (JS) divergence between the underlying density $mathsfp*$ and the GAN estimate.
We show that the JS-divergence between the GAN estimate and $mathsfp*$ decays as fast as $(logn/n)2beta/ (2beta + d)$.
arXiv Detail & Related papers (2021-01-30T09:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.