Sample-Efficient Optimisation with Probabilistic Transformer Surrogates
- URL: http://arxiv.org/abs/2205.13902v2
- Date: Mon, 30 May 2022 08:55:32 GMT
- Title: Sample-Efficient Optimisation with Probabilistic Transformer Surrogates
- Authors: Alexandre Maraval, Matthieu Zimmer, Antoine Grosnit, Rasul Tutunov,
Jun Wang, Haitham Bou Ammar
- Abstract summary: This paper investigates the feasibility of employing state-of-the-art probabilistic transformers in Bayesian optimisation.
We observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation.
We introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance.
- Score: 66.98962321504085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Faced with problems of increasing complexity, recent research in Bayesian
Optimisation (BO) has focused on adapting deep probabilistic models as flexible
alternatives to Gaussian Processes (GPs). In a similar vein, this paper
investigates the feasibility of employing state-of-the-art probabilistic
transformers in BO. Upon further investigation, we observe two drawbacks
stemming from their training procedure and loss definition, hindering their
direct deployment as proxies in black-box optimisation. First, we notice that
these models are trained on uniformly distributed inputs, which impairs
predictive accuracy on non-uniform data - a setting arising from any typical BO
loop due to exploration-exploitation trade-offs. Second, we realise that
training losses (e.g., cross-entropy) only asymptotically guarantee accurate
posterior approximations, i.e., after arriving at the global optimum, which
generally cannot be ensured. At the stationary points of the loss function,
however, we observe a degradation in predictive performance especially in
exploratory regions of the input space. To tackle these shortcomings we
introduce two components: 1) a BO-tailored training prior supporting
non-uniformly distributed points, and 2) a novel approximate posterior
regulariser trading-off accuracy and input sensitivity to filter favourable
stationary points for improved predictive performance. In a large panel of
experiments, we demonstrate, for the first time, that one transformer
pre-trained on data sampled from random GP priors produces competitive results
on 16 benchmark black-boxes compared to GP-based BO. Since our model is only
pre-trained once and used in all tasks without any retraining and/or
fine-tuning, we report an order of magnitude time-reduction, while matching and
sometimes outperforming GPs.
Related papers
- Robust Bayesian Optimization via Localized Online Conformal Prediction [37.549297668783254]
We introduce localized online conformal prediction-based Bayesian optimization (LOCBO)
LOCBO calibrates the GP model through localized online conformal prediction (CP)
We provide theoretical performance guarantees for LOCBO's iterates that hold for the unobserved objective function.
arXiv Detail & Related papers (2024-11-26T12:45:54Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Statistical Foundations of Prior-Data Fitted Networks [0.7614628596146599]
Prior-data fitted networks (PFNs) were recently proposed as a new paradigm for machine learning.
This article establishes a theoretical foundation for PFNs and illuminates the statistical mechanisms governing their behavior.
arXiv Detail & Related papers (2023-05-18T16:34:21Z) - Debiased Fine-Tuning for Vision-language Models by Prompt Regularization [50.41984119504716]
We present a new paradigm for fine-tuning large-scale vision pre-trained models on downstream task, dubbed Prompt Regularization (ProReg)
ProReg uses the prediction by prompting the pretrained model to regularize the fine-tuning.
We show the consistently strong performance of ProReg compared with conventional fine-tuning, zero-shot prompt, prompt tuning, and other state-of-the-art methods.
arXiv Detail & Related papers (2023-01-29T11:53:55Z) - Test-time Batch Normalization [61.292862024903584]
Deep neural networks often suffer the data distribution shift between training and testing.
We revisit the batch normalization (BN) in the training process and reveal two key insights benefiting test-time optimization.
We propose a novel test-time BN layer design, GpreBN, which is optimized during testing by minimizing Entropy loss.
arXiv Detail & Related papers (2022-05-20T14:33:39Z) - Local Gaussian process extrapolation for BART models with applications
to causal inference [0.7734726150561088]
This paper proposes a novel extrapolation strategy that grafts Gaussian processes to the leaf nodes in BART for predicting points outside the range of the observed data.
In simulations studies, the new approach boasts superior performance compared to popular alternatives, such as Jackknife+.
arXiv Detail & Related papers (2022-04-23T00:37:53Z) - Reducing the Amortization Gap in Variational Autoencoders: A Bayesian
Random Function Approach [38.45568741734893]
Inference in our GP model is done by a single feed forward pass through the network, significantly faster than semi-amortized methods.
We show that our approach attains higher test data likelihood than the state-of-the-arts on several benchmark datasets.
arXiv Detail & Related papers (2021-02-05T13:01:12Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.