Related papers: Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization

Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization

URL: http://arxiv.org/abs/2509.08194v1
Date: Tue, 09 Sep 2025 23:56:16 GMT
Title: Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization
Authors: Caio de Prospero Iglesias, Kimberly Villalobos Carballo, Dimitris Bertsimas,
Abstract summary: We propose a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy.<n>We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven.<n>All the code to reproduce the results can be found at https://anonymous.4open.science/r/Prescribe-then-Select-TMLR.
Score: 4.154714580436713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address the problem of policy selection in contextual stochastic optimization (CSO), where covariates are available as contextual information and decisions must satisfy hard feasibility constraints. In many CSO settings, multiple candidate policies--arising from different modeling paradigms--exhibit heterogeneous performance across the covariate space, with no single policy uniformly dominating. We propose Prescribe-then-Select (PS), a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy for the observed covariates. We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven. Across two benchmark CSO problems--single-stage newsvendor and two-stage shipment planning--PS consistently outperforms the best single policy in heterogeneous regimes of the covariate space and converges to the dominant policy when such heterogeneity is absent. All the code to reproduce the results can be found at https://anonymous.4open.science/r/Prescribe-then-Select-TMLR.

Related papers

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model [43.74350307533018]
We study policy alignment to preferences under an unknown and unrestricted complexity.<n>We use first-order optimization suited to neural networks and batched data.
arXiv Detail & Related papers (2025-12-26T08:22:41Z)
Best-Effort Policies for Robust Markov Decision Processes [69.60742680559788]
We study the common generalization of Markov decision processes (MDPs) with sets of transition probabilities, known as robust MDPs (RMDPs)<n>We call such a policy an optimal robust best-effort (ORBE) policy.<n>We prove that ORBE policies always exist, characterize their structure, and present an algorithm to compute them with a small overhead compared to standard robust value iteration.
arXiv Detail & Related papers (2025-08-11T09:18:34Z)
EXPO: Stable Reinforcement Learning with Expressive Policies [74.30151915786233]
We propose a sample-efficient online reinforcement learning algorithm to maximize value with two parameterized policies.<n>Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods.
arXiv Detail & Related papers (2025-07-10T17:57:46Z)
Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data [3.6714630660726586]
offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data.<n>Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes.<n>We propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes.
arXiv Detail & Related papers (2025-05-14T15:44:10Z)
Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy.<n>Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z)
Fat-to-Thin Policy Optimization: Offline RL with Sparse Policies [5.5938591697033555]
Sparse continuous policies are distributions that choose some actions at random yet keep strictly zero probability for the other actions.<n>In this paper, we propose the first offline policy optimization algorithm that tackles this challenge: Fat-to-Thin Policy Optimization (FtTPO)<n>We instantiate FtTPO with the general $q$-Gaussian family that encompasses both heavy-tailed and sparse policies.
arXiv Detail & Related papers (2025-01-24T10:11:48Z)
Constraint-Generation Policy Optimization (CGPO): Nonlinear Programming for Policy Optimization in Mixed Discrete-Continuous MDPs [21.246169498568342]
CGPO provides bounded policy error guarantees over an infinite range of initial states for many DC-MDPs with expressive nonlinear dynamics.<n>CGPO can generate worst-case state trajectories to diagnose policy deficiencies and provide counterfactual explanations of optimal actions.<n>We experimentally demonstrate the applicability of CGPO across various domains, including inventory control, management of a water reservoir system, and physics control.
arXiv Detail & Related papers (2024-01-20T07:12:57Z)
Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer [7.970144204429356]
We introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set. We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks.
arXiv Detail & Related papers (2022-06-22T19:00:08Z)
CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
We consider and study a distribution of optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
arXiv Detail & Related papers (2022-05-19T09:48:56Z)
Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.