Related papers: PAC-Bayes Meets Online Contextual Optimization

PAC-Bayes Meets Online Contextual Optimization

URL: http://arxiv.org/abs/2511.20413v1
Date: Tue, 25 Nov 2025 15:37:31 GMT
Title: PAC-Bayes Meets Online Contextual Optimization
Authors: Zhuojun Xie, Adam Abdin, Yiping Fang,
Abstract summary: This work introduces, to the best of our knowledge, the first online contextual optimization framework.<n>Grounded in PAC-Bayes theory and general Bayesian updating principles, our framework achieves $mathcalO(sqrtT)$ regret for bounded and mixable losses via a Gibbs posterior.
Score: 4.004966432215451
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The predict-then-optimize paradigm bridges online learning and contextual optimization in dynamic environments. Previous works have investigated the sequential updating of predictors using feedback from downstream decisions to minimize regret in the full-information settings. However, existing approaches are predominantly frequentist, rely heavily on gradient-based strategies, and employ deterministic predictors that could yield high variance in practice despite their asymptotic guarantees. This work introduces, to the best of our knowledge, the first Bayesian online contextual optimization framework. Grounded in PAC-Bayes theory and general Bayesian updating principles, our framework achieves $\mathcal{O}(\sqrt{T})$ regret for bounded and mixable losses via a Gibbs posterior, eliminates the dependence on gradients through sequential Monte Carlo samplers, and thereby accommodates nondifferentiable problems. Theoretical developments and numerical experiments substantiate our claims.

Related papers

Coverage Improvement and Fast Convergence of On-policy Preference Learning [67.36750525893514]
Online on-policy preference learning algorithms for language model alignment can significantly outperform their offline counterparts.<n>We analyze how the sampling policy's coverage evolves throughout on-policy training.<n>We develop principled on-policy schemes for reward distillation in the general function class setting.
arXiv Detail & Related papers (2026-01-13T10:46:06Z)
Sequential Bayesian Optimal Experimental Design in Infinite Dimensions via Policy Gradient Reinforcement Learning [3.2580743227673694]
High-fidelity approaches require repeated forward and adjoint PDE solves inside nested Bayesian inversion and design loops.<n>We formulate SBOED as a finite-horizon Markov decision process and learn an amortized design policy via policy-gradient reinforcement learning.<n> Numerical experiments on sequential multi-sensor placement for contaminant source tracking demonstrate approximately $100times$ speedup over high-fidelity finite element methods.
arXiv Detail & Related papers (2026-01-09T15:44:49Z)
In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior [53.21550098214227]
In-context reinforcement learning promises fast adaptation to unseen environments without parameter updates.<n>We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time.<n>We prove that SPICE achieves regret-optimal behaviour in both bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories.
arXiv Detail & Related papers (2026-01-06T13:41:31Z)
On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference [71.09125259964684]
Test-time compute (TTC) has become an increasingly prominent paradigm for enhancing large language models (LLMs)<n>We study reward-filtered sequential inference, a simple procedure that selectively incorporates only high-reward generations into the context.<n>On the theoretical side, we show that reward-filtered sequential inference yields strictly stronger guarantees than standard TTC paradigms.
arXiv Detail & Related papers (2025-12-04T08:21:33Z)
Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients [16.39606116102731]
The vanishing Polyak delivering adaptive neural network has proven to be a promising choice for gradient descent (SGD)<n> Comprehensive experiments on deep networks corroborate tight convex network theory.<n>In this work, we provide rigorous convergence guarantees for non-smooth optimization with no need for strong assumptions.
arXiv Detail & Related papers (2025-12-02T02:24:32Z)
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [69.1820058966619]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model.<n>We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Discounted Adaptive Online Learning: Towards Better Regularization [5.5899168074961265]
We study online learning in adversarial nonstationary environments. We propose an adaptive (i.e., instance optimal) algorithm that improves the widespread non-adaptive baseline. We also consider the (Gibbs and Candes, 2021)-style online conformal prediction problem.
arXiv Detail & Related papers (2024-02-05T04:29:39Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Improving Generalization of Complex Models under Unbounded Loss Using PAC-Bayes Bounds [10.94126149188336]
PAC-Bayes learning theory has focused extensively on establishing tight upper bounds for test errors. A recently proposed training procedure called PAC-Bayes training, updates the model toward minimizing these bounds. This approach is theoretically sound, in practice, it has not achieved a test error as low as those obtained by empirical risk minimization (ERM) We introduce a new PAC-Bayes training algorithm with improved performance and reduced reliance on prior tuning.
arXiv Detail & Related papers (2023-05-30T17:31:25Z)
AdaTerm: Adaptive T-Distribution Estimated Robust Moments for Noise-Robust Stochastic Gradient Optimization [14.531550983885772]
We propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the first-order moment but also all associated statistics. This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time.
arXiv Detail & Related papers (2022-01-18T03:13:19Z)
Beyond variance reduction: Understanding the true impact of baselines on policy optimization [24.09670734037029]
We show that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics.
arXiv Detail & Related papers (2020-08-31T17:52:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.