Scalable First-Order Bayesian Optimization via Structured Automatic
Differentiation
- URL: http://arxiv.org/abs/2206.08366v1
- Date: Thu, 16 Jun 2022 17:59:48 GMT
- Title: Scalable First-Order Bayesian Optimization via Structured Automatic
Differentiation
- Authors: Sebastian Ament and Carla Gomes
- Abstract summary: We show that a wide range of kernels gives rise to structured matrices, enabling an exact $mathcalO(n2d)$ matrix-vector multiply for gradient observations and $mathcalO(n2d2)$ for Hessian observations.
Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels.
- Score: 4.061135251278187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bayesian Optimization (BO) has shown great promise for the global
optimization of functions that are expensive to evaluate, but despite many
successes, standard approaches can struggle in high dimensions. To improve the
performance of BO, prior work suggested incorporating gradient information into
a Gaussian process surrogate of the objective, giving rise to kernel matrices
of size $nd \times nd$ for $n$ observations in $d$ dimensions. Na\"ively
multiplying with (resp. inverting) these matrices requires
$\mathcal{O}(n^2d^2)$ (resp. $\mathcal{O}(n^3d^3$)) operations, which becomes
infeasible for moderate dimensions and sample sizes. Here, we observe that a
wide range of kernels gives rise to structured matrices, enabling an exact
$\mathcal{O}(n^2d)$ matrix-vector multiply for gradient observations and
$\mathcal{O}(n^2d^2)$ for Hessian observations. Beyond canonical kernel
classes, we derive a programmatic approach to leveraging this type of structure
for transformations and combinations of the discussed kernel classes, which
constitutes a structure-aware automatic differentiation algorithm. Our methods
apply to virtually all canonical kernels and automatically extend to complex
kernels, like the neural network, radial basis function network, and spectral
mixture kernels without any additional derivations, enabling flexible,
problem-dependent modeling while scaling first-order BO to high $d$.
Related papers
- Optimization without Retraction on the Random Generalized Stiefel Manifold [9.301728976515255]
We propose a cheap iterative method that solves the optimization problem while having access only to a random estimates of $B$.
Our method does not enforce the constraint in every iteration; instead, it produces iterations that converge to critical points on the generalized Stiefel manifold defined in expectation.
arXiv Detail & Related papers (2024-05-02T19:55:30Z) - Low-complexity subspace-descent over symmetric positive definite
manifold [9.346050098365648]
We develop low-complexity algorithms for the minimization of functions over the symmetric positive definite (SPD) manifold.
The proposed approach utilizes carefully chosen subspaces that allow the update to be written as a product of the Cholesky factor of the iterate and a sparse matrix.
arXiv Detail & Related papers (2023-05-03T11:11:46Z) - Efficient Convex Algorithms for Universal Kernel Learning [50.877957471649395]
An ideal set of kernels should: admit a linear parameterization (for tractability); dense in the set of all kernels (for accuracy)
Previous algorithms for optimization of kernels were limited to classification and relied on computationally complex Semidefinite Programming (SDP) algorithms.
We propose a SVD-QCQPQP algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches.
arXiv Detail & Related papers (2023-04-15T04:57:37Z) - Data-Driven Linear Complexity Low-Rank Approximation of General Kernel
Matrices: A Geometric Approach [0.9453554184019107]
A kernel matrix may be defined as $K_ij = kappa(x_i,y_j)$ where $kappa(x,y)$ is a kernel function.
We seek a low-rank approximation to a kernel matrix where the sets of points $X$ and $Y$ are large.
arXiv Detail & Related papers (2022-12-24T07:15:00Z) - Kernel Packet: An Exact and Scalable Algorithm for Gaussian Process
Regression with Mat\'ern Correlations [23.560067934682294]
We develop an exact and scalable algorithm for one-dimensional Gaussian process regression with Mat'ern correlations.
The proposed algorithm is significantly superior to the existing alternatives in both the computational time and predictive accuracy.
arXiv Detail & Related papers (2022-03-07T03:30:35Z) - Geometry-aware Bayesian Optimization in Robotics using Riemannian
Mat\'ern Kernels [64.62221198500467]
We show how to implement geometry-aware kernels for Bayesian optimization.
This technique can be used for control parameter tuning, parametric policy adaptation, and structure design in robotics.
arXiv Detail & Related papers (2021-11-02T09:47:22Z) - Sparse Quadratic Optimisation over the Stiefel Manifold with Application
to Permutation Synchronisation [71.27989298860481]
We address the non- optimisation problem of finding a matrix on the Stiefel manifold that maximises a quadratic objective function.
We propose a simple yet effective sparsity-promoting algorithm for finding the dominant eigenspace matrix.
arXiv Detail & Related papers (2021-09-30T19:17:35Z) - Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations.
This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation.
We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z) - High-Dimensional Gaussian Process Inference with Derivatives [90.8033626920884]
We show that in the low-data regime $ND$, the Gram matrix can be decomposed in a manner that reduces the cost of inference to $mathcalO(N2D + (N2)3)$.
We demonstrate this potential in a variety of tasks relevant for machine learning, such as optimization and Hamiltonian Monte Carlo with predictive gradients.
arXiv Detail & Related papers (2021-02-15T13:24:41Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.