Conformal Policy Control
- URL: http://arxiv.org/abs/2603.02196v1
- Date: Mon, 02 Mar 2026 18:54:36 GMT
- Title: Conformal Policy Control
- Authors: Drew Prinster, Clara Fannjiang, Ji Won Park, Kyunghyun Cho, Anqi Liu, Suchi Saria, Samuel Stanton,
- Abstract summary: We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy.<n>Unlike conservative optimization methods, we do not assume the user has identified the correct model class.<n>Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is possible from the first moment of deployment.
- Score: 50.46542384484142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
Related papers
- Safe Exploration via Policy Priors [45.58021831092113]
We show that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret.<n>Experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.
arXiv Detail & Related papers (2026-01-27T13:45:28Z) - Proximal Ranking Policy Optimization for Practical Safety in Counterfactual Learning to Rank [64.44255178199846]
We propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior.
PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model.
Our experiments show that PRPO provides higher performance than the existing safe inverse propensity scoring approach.
arXiv Detail & Related papers (2024-09-15T22:22:27Z) - CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies [30.57323631122579]
We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising.
Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements.
We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level.
arXiv Detail & Related papers (2024-08-21T21:38:03Z) - Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to Rank [64.44255178199846]
We generalize the existing safe CLTR approach to make it applicable to state-of-the-art doubly robust CLTR.
We also propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior.
PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.
arXiv Detail & Related papers (2024-07-29T12:23:59Z) - Information-Theoretic Safe Bayesian Optimization [59.758009422067005]
We consider a sequential decision making task, where the goal is to optimize an unknown function without evaluating parameters that violate an unknown (safety) constraint.
Most current methods rely on a discretization of the domain and cannot be directly extended to the continuous case.
We propose an information-theoretic safe exploration criterion that directly exploits the GP posterior to identify the most informative safe parameters to evaluate.
arXiv Detail & Related papers (2024-02-23T14:31:10Z) - Information-Theoretic Safe Exploration with Gaussian Processes [89.31922008981735]
We consider a sequential decision making task where we are not allowed to evaluate parameters that violate an unknown (safety) constraint.
Most current methods rely on a discretization of the domain and cannot be directly extended to the continuous case.
We propose an information-theoretic safe exploration criterion that directly exploits the GP posterior to identify the most informative safe parameters to evaluate.
arXiv Detail & Related papers (2022-12-09T15:23:58Z) - Conformal Off-Policy Prediction in Contextual Bandits [54.67508891852636]
Conformal off-policy prediction can output reliable predictive intervals for the outcome under a new target policy.
We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup.
arXiv Detail & Related papers (2022-06-09T10:39:33Z) - Learn Zero-Constraint-Violation Policy in Model-Free Constrained
Reinforcement Learning [7.138691584246846]
We propose the safe set actor-critic (SSAC) algorithm, which confines the policy update using safety-oriented energy functions.
The safety index is designed to increase rapidly for potentially dangerous actions.
We claim that we can learn the energy function in a model-free manner similar to learning a value function.
arXiv Detail & Related papers (2021-11-25T07:24:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.