Related papers: MP-GELU Bayesian Neural Networks: Moment Propagation by GELU Nonlinearity

MP-GELU Bayesian Neural Networks: Moment Propagation by GELU Nonlinearity

URL: http://arxiv.org/abs/2211.13402v1
Date: Thu, 24 Nov 2022 03:37:29 GMT
Title: MP-GELU Bayesian Neural Networks: Moment Propagation by GELU Nonlinearity
Authors: Yuki Hirayama, Sinya Takamaeda-Yamazaki
Abstract summary: We propose a novel nonlinear function named moment propagating-Gaussian error linear unit (MP-GELU) that enables the fast derivation of first and second moments in BNNs. MP-GELU provides higher prediction accuracy and better quality of uncertainty with faster execution than those of ReLU-based BNNs.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bayesian neural networks (BNNs) have been an important framework in the study of uncertainty quantification. Deterministic variational inference, one of the inference methods, utilizes moment propagation to compute the predictive distributions and objective functions. Unfortunately, deriving the moments requires computationally expensive Taylor expansion in nonlinear functions, such as a rectified linear unit (ReLU) or a sigmoid function. Therefore, a new nonlinear function that realizes faster moment propagation than conventional functions is required. In this paper, we propose a novel nonlinear function named moment propagating-Gaussian error linear unit (MP-GELU) that enables the fast derivation of first and second moments in BNNs. MP-GELU enables the analytical computation of moments by applying nonlinearity to the input statistics, thereby reducing the computationally expensive calculations required for nonlinear functions. In empirical experiments on regression tasks, we observed that the proposed MP-GELU provides higher prediction accuracy and better quality of uncertainty with faster execution than those of ReLU-based BNNs.

Related papers

Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime [26.47265060394168]
We study the convergence of methods for the training of mean-field-hidden-layer neural networks.<n>We show that a strategy enables provable convergence rates for the sampling of a teacher feature distribution.
arXiv Detail & Related papers (2025-04-25T09:40:10Z)
Uncertainty propagation in feed-forward neural network models [3.987067170467799]
We develop new uncertainty propagation methods for feed-forward neural network architectures. We derive analytical expressions for the probability density function (PDF) of the neural network output. A key finding is that an appropriate linearization of the leaky ReLU activation function yields accurate statistical results.
arXiv Detail & Related papers (2025-03-27T00:16:36Z)
Fréchet Cumulative Covariance Net for Deep Nonlinear Sufficient Dimension Reduction with Random Objects [22.156257535146004]
We introduce a new statistical dependence measure termed Fr'echet Cumulative Covariance (FCCov) and develop a novel nonlinear SDR framework based on FCCov. Our approach is not only applicable to complex non-Euclidean data, but also exhibits robustness against outliers. We prove that our method with squared Frobenius norm regularization achieves unbiasedness at the $sigma$-field level.
arXiv Detail & Related papers (2025-02-21T10:55:50Z)
Neural empirical interpolation method for nonlinear model reduction [0.0]
We introduce the neural empirical method (NEIM) for reducing the time complexity of computing the nonlinear term in a reduced order model (ROM) NEIM is a greedy algorithm which accomplishes this reduction by approximating an affine decomposition of the nonlinear term of the ROM. Because NEIM is based on a greedy strategy, we are able to provide a basic error analysis to investigate its performance.
arXiv Detail & Related papers (2024-06-05T18:17:33Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
Nonlinearity Enhanced Adaptive Activation Functions [0.0]
Examples are given based on the standard rectified linear unit (ReLU)<n>The associated accuracy improvement is quantified both in the context of the MNIST digit data set and a convolutional neural network (CNN) benchmark example.
arXiv Detail & Related papers (2024-03-29T00:33:37Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks. We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z)
Inference on Time Series Nonparametric Conditional Moment Restrictions Using General Sieves [4.065100518793487]
This paper considers general nonlinear sieve quasi-likelihood ratio (GN-QLR) based on expectation inferences of time series data. While the normality of the estimated functionals depends on some unknown Riesz representer of the functional space, we show that the optimally weighted GN-QLR statistic is Chi-square distributed.
arXiv Detail & Related papers (2022-12-31T01:44:17Z)
Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs) Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space. This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z)
Going Beyond Linear RL: Sample Efficient Neural Function Approximation [76.57464214864756]
We study function approximation with two-layer neural networks. Our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
arXiv Detail & Related papers (2021-07-14T03:03:56Z)
CDiNN -Convex Difference Neural Networks [0.8122270502556374]
Neural networks with ReLU activation function have been shown to be universal function approximators learn function mapping as non-smooth functions. New neural network architecture called ICNNs learn the output as a convex input.
arXiv Detail & Related papers (2021-03-31T17:31:16Z)
Learning Recurrent Neural Net Models of Nonlinear Systems [10.5811404306981]
We find a continuous-time recurrent neural net with hyperbolic tangent activation function that approximately reproduces the underlying i/o behavior with high confidence. We derive quantitative guarantees on the sup-norm risk of the learned model in terms of the number of neurons, the sample size, the number of derivatives being matched, and the regularity properties of the inputs, the outputs, and the unknown i/o map.
arXiv Detail & Related papers (2020-11-18T22:53:41Z)
SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features. We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.