Optimistic Feasible Search for Closed-Loop Fair Threshold Decision-Making
- URL: http://arxiv.org/abs/2512.22313v1
- Date: Fri, 26 Dec 2025 10:44:40 GMT
- Title: Optimistic Feasible Search for Closed-Loop Fair Threshold Decision-Making
- Authors: Wenzhang Du,
- Abstract summary: We study online learning of a one-dimensional threshold policy from bandit feedback.<n>We propose Optimistic Feasible Search (OFS), a simple grid-based method that maintains confidence bounds for reward and constraint residuals.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Closed-loop decision-making systems (e.g., lending, screening, or recidivism risk assessment) often operate under fairness and service constraints while inducing feedback effects: decisions change who appears in the future, yielding non-stationary data and potentially amplifying disparities. We study online learning of a one-dimensional threshold policy from bandit feedback under demographic parity (DP) and, optionally, service-rate constraints. The learner observes only a scalar score each round and selects a threshold; reward and constraint residuals are revealed only for the chosen threshold. We propose Optimistic Feasible Search (OFS), a simple grid-based method that maintains confidence bounds for reward and constraint residuals for each candidate threshold. At each round, OFS selects a threshold that appears feasible under confidence bounds and, among those, maximizes optimistic reward; if no threshold appears feasible, OFS selects the threshold minimizing optimistic constraint violation. This design directly targets feasible high-utility thresholds and is particularly effective for low-dimensional, interpretable policy classes where discretization is natural. We evaluate OFS on (i) a synthetic closed-loop benchmark with stable contraction dynamics and (ii) two semi-synthetic closed-loop benchmarks grounded in German Credit and COMPAS, constructed by training a score model and feeding group-dependent acceptance decisions back into population composition. Across all environments, OFS achieves higher reward with smaller cumulative constraint violation than unconstrained and primal-dual bandit baselines, and is near-oracle relative to the best feasible fixed threshold under the same sweep procedure. Experiments are reproducible and organized with double-blind-friendly relative outputs.
Related papers
- Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards [39.489554597919145]
Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion.<n>For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit.<n>We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block.
arXiv Detail & Related papers (2026-02-10T19:22:37Z) - SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization [0.0]
We introduce SLIME, a reference-free alignment objective designed to decouple preference learning from generation quality.<n>Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-02T17:46:06Z) - Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination [8.826173150779145]
Distributionally robust optimisation (DRO) minimises the worst-case expected loss over an ambiguity set.<n>We show how IP credal sets translate into DRO objectives with interpretable tolerance levels.
arXiv Detail & Related papers (2026-01-29T06:37:36Z) - LEC: Linear Expectation Constraints for False-Discovery Control in Selective Prediction and Routing Systems [95.35293543918762]
Large language models (LLMs) often generate unreliable answers, while uncertainty methods fail to fully distinguish correct from incorrect predictions.<n>We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level.<n>We propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint.
arXiv Detail & Related papers (2025-12-01T11:27:09Z) - Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models [2.231476498067998]
dissertation investigates how reinforcement learning methods can be designed to be safe, sample-efficient, and robust.<n> Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains - ranking and recommendation, and text-to-image diffusion models.
arXiv Detail & Related papers (2025-10-17T08:37:38Z) - Unsupervised Conformal Inference: Bootstrapping and Alignment to Control LLM Uncertainty [49.19257648205146]
We propose an unsupervised conformal inference framework for generation.<n>Our gates achieve close-to-nominal coverage and provide tighter, more stable thresholds than split UCP.<n>The result is a label-free, API-compatible gate for test-time filtering.
arXiv Detail & Related papers (2025-09-26T23:40:47Z) - Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time [52.230936493691985]
We propose SITAlign, an inference framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria.<n>We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach.
arXiv Detail & Related papers (2025-05-29T17:56:05Z) - Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds [28.4976864705409]
This paper studies constrained Markov decision processes (CMDPs) with constraints against thresholds, aiming at the safety of reinforcement learning in unknown and uncertain environments.<n>We leverage a GrowingWindow estimator sampling from interactions with the uncertain and dynamic environment to estimate the thresholds, based on which we design Pessimistic-Optimistic Thresholding (SPOT)<n>SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings.
arXiv Detail & Related papers (2025-04-07T11:58:19Z) - Constrained Linear Thompson Sampling [39.724313550777715]
Constrained Linear Thompson Sampling (COLTS) is a sampling-based framework that selects actions by solving perturbed linear programs.<n>We develop two main variants: S-COLTS, which ensures zero risk and $widetildeO(sqrtd3 T)$ regret given a safe action, and R-COLTS, which achieves $widetildeO(sqrtd3 T)$ regret and risk with no instance information.
arXiv Detail & Related papers (2025-03-03T20:44:58Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - C-Learner: Constrained Learning for Causal Inference [4.370964009390564]
We propose a novel debiasing approach that achieves the best weighting of both worlds, producing stable plug-in estimates.<n>Our constrained learning framework solves for the best plug-in estimator under the constraint that the first-order error with respect to the plugged-in quantity is zero.
arXiv Detail & Related papers (2024-05-15T16:38:28Z) - Equal Opportunity of Coverage in Fair Regression [50.76908018786335]
We study fair machine learning (ML) under predictive uncertainty to enable reliable and trustworthy decision-making.
We propose Equal Opportunity of Coverage (EOC) that aims to achieve two properties: (1) coverage rates for different groups with similar outcomes are close, and (2) the coverage rate for the entire population remains at a predetermined level.
arXiv Detail & Related papers (2023-11-03T21:19:59Z) - On Lower Bounds for Standard and Robust Gaussian Process Bandit
Optimization [55.937424268654645]
We consider algorithm-independent lower bounds for the problem of black-box optimization of functions having a bounded norm.
We provide a novel proof technique for deriving lower bounds on the regret, with benefits including simplicity, versatility, and an improved dependence on the error probability.
arXiv Detail & Related papers (2020-08-20T03:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.