MP-GELU Bayesian Neural Networks: Moment Propagation by GELU
Nonlinearity
- URL: http://arxiv.org/abs/2211.13402v1
- Date: Thu, 24 Nov 2022 03:37:29 GMT
- Title: MP-GELU Bayesian Neural Networks: Moment Propagation by GELU
Nonlinearity
- Authors: Yuki Hirayama, Sinya Takamaeda-Yamazaki
- Abstract summary: We propose a novel nonlinear function named moment propagating-Gaussian error linear unit (MP-GELU) that enables the fast derivation of first and second moments in BNNs.
MP-GELU provides higher prediction accuracy and better quality of uncertainty with faster execution than those of ReLU-based BNNs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bayesian neural networks (BNNs) have been an important framework in the study
of uncertainty quantification. Deterministic variational inference, one of the
inference methods, utilizes moment propagation to compute the predictive
distributions and objective functions. Unfortunately, deriving the moments
requires computationally expensive Taylor expansion in nonlinear functions,
such as a rectified linear unit (ReLU) or a sigmoid function. Therefore, a new
nonlinear function that realizes faster moment propagation than conventional
functions is required. In this paper, we propose a novel nonlinear function
named moment propagating-Gaussian error linear unit (MP-GELU) that enables the
fast derivation of first and second moments in BNNs. MP-GELU enables the
analytical computation of moments by applying nonlinearity to the input
statistics, thereby reducing the computationally expensive calculations
required for nonlinear functions. In empirical experiments on regression tasks,
we observed that the proposed MP-GELU provides higher prediction accuracy and
better quality of uncertainty with faster execution than those of ReLU-based
BNNs.
Related papers
- Neural empirical interpolation method for nonlinear model reduction [0.0]
We introduce the neural empirical method (NEIM) for reducing the time complexity of computing the nonlinear term in a reduced order model (ROM)
NEIM is a greedy algorithm which accomplishes this reduction by approximating an affine decomposition of the nonlinear term of the ROM.
Because NEIM is based on a greedy strategy, we are able to provide a basic error analysis to investigate its performance.
arXiv Detail & Related papers (2024-06-05T18:17:33Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks.
We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z) - Inference on Time Series Nonparametric Conditional Moment Restrictions
Using General Sieves [4.065100518793487]
This paper considers general nonlinear sieve quasi-likelihood ratio (GN-QLR) based on expectation inferences of time series data.
While the normality of the estimated functionals depends on some unknown Riesz representer of the functional space, we show that the optimally weighted GN-QLR statistic is Chi-square distributed.
arXiv Detail & Related papers (2022-12-31T01:44:17Z) - Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs)
Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space.
This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z) - Going Beyond Linear RL: Sample Efficient Neural Function Approximation [76.57464214864756]
We study function approximation with two-layer neural networks.
Our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
arXiv Detail & Related papers (2021-07-14T03:03:56Z) - CDiNN -Convex Difference Neural Networks [0.8122270502556374]
Neural networks with ReLU activation function have been shown to be universal function approximators learn function mapping as non-smooth functions.
New neural network architecture called ICNNs learn the output as a convex input.
arXiv Detail & Related papers (2021-03-31T17:31:16Z) - Learning Recurrent Neural Net Models of Nonlinear Systems [10.5811404306981]
We find a continuous-time recurrent neural net with hyperbolic tangent activation function that approximately reproduces the underlying i/o behavior with high confidence.
We derive quantitative guarantees on the sup-norm risk of the learned model in terms of the number of neurons, the sample size, the number of derivatives being matched, and the regularity properties of the inputs, the outputs, and the unknown i/o map.
arXiv Detail & Related papers (2020-11-18T22:53:41Z) - SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for
Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features.
We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.