Related papers: A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions

A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions

URL: http://arxiv.org/abs/2004.08867v3
Date: Mon, 16 Nov 2020 04:29:47 GMT
Title: A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions
Authors: Yulong Lu and Jianfeng Lu
Abstract summary: We prove that there exists a deep neural network $g:mathbbRdrightarrow mathbbR$ with ReLU activation. The size of neural network can grow exponentially in $d$ when $1$-Wasserstein distance is used as the discrepancy.
Score: 12.100913944042972
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies the universal approximation property of deep neural networks for representing probability distributions. Given a target distribution $\pi$ and a source distribution $p_z$ both defined on $\mathbb{R}^d$, we prove under some assumptions that there exists a deep neural network $g:\mathbb{R}^d\rightarrow \mathbb{R}$ with ReLU activation such that the push-forward measure $(\nabla g)_\# p_z$ of $p_z$ under the map $\nabla g$ is arbitrarily close to the target measure $\pi$. The closeness are measured by three classes of integral probability metrics between probability distributions: $1$-Wasserstein distance, maximum mean distance (MMD) and kernelized Stein discrepancy (KSD). We prove upper bounds for the size (width and depth) of the deep neural network in terms of the dimension $d$ and the approximation error $\varepsilon$ with respect to the three discrepancies. In particular, the size of neural network can grow exponentially in $d$ when $1$-Wasserstein distance is used as the discrepancy, whereas for both MMD and KSD the size of neural network only depends on $d$ at most polynomially. Our proof relies on convergence estimates of empirical measures under aforementioned discrepancies and semi-discrete optimal transport.

Related papers

Finite-Dimensional Gaussian Approximation for Deep Neural Networks: Universality in Random Weights [15.424946932398713]
We study the Finite-Dimensional Distributions (FDDs) of deep neural networks with randomly weights that have finite-order moments.<n>We establish Gaussian approximation bounds in the Wasserstein-$1$ norm between the FDDs and their Gaussian limit.<n>In the special case where all widths are proportional to a common scale parameter $n$ and there are $L-1$ hidden layers, we obtain convergence rates of order $n-(1/6)L-1 + epsilon$, for any $epsilon > 0$.
arXiv Detail & Related papers (2025-07-16T23:41:09Z)
Approximation and Generalization Abilities of Score-based Neural Network Generative Models for Sub-Gaussian Distributions [18.375250624200373]
We study the approximation and abilities of score-based neural network generative models (SGMs)<n>Our framework is universal and can be used to establish convergence rates for SGMs under milder assumptions than previous work.<n>Our analysis removes several crucial assumptions, such as Lipschitz continuity of the score function or strictly positive lower bound on the target density.
arXiv Detail & Related papers (2025-05-16T05:38:28Z)
New advances in universal approximation with neural networks of minimal width [4.424170214926035]
We show that autoencoders with leaky ReLU activations are universal approximators of $Lp$ functions. We broaden our results to show that smooth invertible neural networks can approximate $Lp(mathbbRd,mathbbRd)$ on compacta.
arXiv Detail & Related papers (2024-11-13T16:17:16Z)
Relative-Translation Invariant Wasserstein Distance [82.6068808353647]
We introduce a new family of distances, relative-translation invariant Wasserstein distances ($RW_p$) We show that $RW_p distances are also real distance metrics defined on the quotient set $mathcalP_p(mathbbRn)/sim$ invariant to distribution translations.
arXiv Detail & Related papers (2024-09-04T03:41:44Z)
A Theory of Interpretable Approximations [61.90216959710842]
We study the idea of approximating a target concept $c$ by a small aggregation of concepts from some base class $mathcalH$. For any given pair of $mathcalH$ and $c$, exactly one of these cases holds: (i) $c$ cannot be approximated by $mathcalH$ with arbitrary accuracy. We show that, in the case of interpretable approximations, even a slightly nontrivial a-priori guarantee on the complexity of approximations implies approximations with constant (distribution-free and accuracy-
arXiv Detail & Related papers (2024-06-15T06:43:45Z)
Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable. We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z)
Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds [75.51968172401394]
We study the sample complexity of the neural policy mirror descent (NPMD) algorithm with deep convolutional neural networks (CNN) In each iteration of NPMD, both the value function and the policy can be well approximated by CNNs. We show that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality.
arXiv Detail & Related papers (2023-09-25T07:31:22Z)
Quantitative CLTs in Deep Neural Networks [12.845031126178593]
We study the distribution of a fully connected neural network with random Gaussian weights and biases. We obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our bounds are strictly stronger in terms of their dependence on network width than any previously available in the literature.
arXiv Detail & Related papers (2023-07-12T11:35:37Z)
Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures. This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z)
Overparametrized linear dimensionality reductions: From projection pursuit to two-layer neural networks [10.368585938419619]
Given a cloud of $n$ data points in $mathbbRd$, consider all projections onto $m$-dimensional subspaces of $mathbbRd$. What does this collection of probability distributions look like when $n,d$ grow large? Denoting by $mathscrF_m, alpha$ the set of probability distributions in $mathbbRm$ that arise as low-dimensional projections in this limit, we establish new inner and outer bounds on $mathscrF_
arXiv Detail & Related papers (2022-06-14T00:07:33Z)
A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions. We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$. We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z)
Fundamental tradeoffs between memorization and robustness in random features and neural tangent regimes [15.76663241036412]
We prove for a large class of activation functions that, if the model memorizes even a fraction of the training, then its Sobolev-seminorm is lower-bounded. Experiments reveal for the first time, (iv) a multiple-descent phenomenon in the robustness of the min-norm interpolator.
arXiv Detail & Related papers (2021-06-04T17:52:50Z)
Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient [9.309655246559094]
We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(log(dl))$ wider and twice as deep. Our analysis relies on connecting pruning random ReLU networks to random instances of the textscSubset problem.
arXiv Detail & Related papers (2020-06-14T19:32:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.