Related papers: Optimistic Feasible Search for Closed-Loop Fair Threshold Decision-Making

Optimistic Feasible Search for Closed-Loop Fair Threshold Decision-Making

URL: http://arxiv.org/abs/2512.22313v1
Date: Fri, 26 Dec 2025 10:44:40 GMT
Title: Optimistic Feasible Search for Closed-Loop Fair Threshold Decision-Making
Authors: Wenzhang Du,
Abstract summary: We study online learning of a one-dimensional threshold policy from bandit feedback.<n>We propose Optimistic Feasible Search (OFS), a simple grid-based method that maintains confidence bounds for reward and constraint residuals.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Closed-loop decision-making systems (e.g., lending, screening, or recidivism risk assessment) often operate under fairness and service constraints while inducing feedback effects: decisions change who appears in the future, yielding non-stationary data and potentially amplifying disparities. We study online learning of a one-dimensional threshold policy from bandit feedback under demographic parity (DP) and, optionally, service-rate constraints. The learner observes only a scalar score each round and selects a threshold; reward and constraint residuals are revealed only for the chosen threshold. We propose Optimistic Feasible Search (OFS), a simple grid-based method that maintains confidence bounds for reward and constraint residuals for each candidate threshold. At each round, OFS selects a threshold that appears feasible under confidence bounds and, among those, maximizes optimistic reward; if no threshold appears feasible, OFS selects the threshold minimizing optimistic constraint violation. This design directly targets feasible high-utility thresholds and is particularly effective for low-dimensional, interpretable policy classes where discretization is natural. We evaluate OFS on (i) a synthetic closed-loop benchmark with stable contraction dynamics and (ii) two semi-synthetic closed-loop benchmarks grounded in German Credit and COMPAS, constructed by training a score model and feeding group-dependent acceptance decisions back into population composition. Across all environments, OFS achieves higher reward with smaller cumulative constraint violation than unconstrained and primal-dual bandit baselines, and is near-oracle relative to the best feasible fixed threshold under the same sweep procedure. Experiments are reproducible and organized with double-blind-friendly relative outputs.

Related papers

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards [39.489554597919145]
Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion.<n>For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit.<n>We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block.
arXiv Detail & Related papers (2026-02-10T19:22:37Z)
SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization [0.0]
We introduce SLIME, a reference-free alignment objective designed to decouple preference learning from generation quality.<n>Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-02T17:46:06Z)
Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination [8.826173150779145]
Distributionally robust optimisation (DRO) minimises the worst-case expected loss over an ambiguity set.<n>We show how IP credal sets translate into DRO objectives with interpretable tolerance levels.
arXiv Detail & Related papers (2026-01-29T06:37:36Z)
LEC: Linear Expectation Constraints for False-Discovery Control in Selective Prediction and Routing Systems [95.35293543918762]
Large language models (LLMs) often generate unreliable answers, while uncertainty methods fail to fully distinguish correct from incorrect predictions.<n>We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level.<n>We propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint.
arXiv Detail & Related papers (2025-12-01T11:27:09Z)
Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models [2.231476498067998]
dissertation investigates how reinforcement learning methods can be designed to be safe, sample-efficient, and robust.<n> Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains - ranking and recommendation, and text-to-image diffusion models.
arXiv Detail & Related papers (2025-10-17T08:37:38Z)
Unsupervised Conformal Inference: Bootstrapping and Alignment to Control LLM Uncertainty [49.19257648205146]
We propose an unsupervised conformal inference framework for generation.<n>Our gates achieve close-to-nominal coverage and provide tighter, more stable thresholds than split UCP.<n>The result is a label-free, API-compatible gate for test-time filtering.
arXiv Detail & Related papers (2025-09-26T23:40:47Z)
Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time [52.230936493691985]
We propose SITAlign, an inference framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria.<n>We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach.
arXiv Detail & Related papers (2025-05-29T17:56:05Z)
Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds [28.4976864705409]
This paper studies constrained Markov decision processes (CMDPs) with constraints against thresholds, aiming at the safety of reinforcement learning in unknown and uncertain environments.<n>We leverage a GrowingWindow estimator sampling from interactions with the uncertain and dynamic environment to estimate the thresholds, based on which we design Pessimistic-Optimistic Thresholding (SPOT)<n>SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings.
arXiv Detail & Related papers (2025-04-07T11:58:19Z)
Constrained Linear Thompson Sampling [39.724313550777715]
Constrained Linear Thompson Sampling (COLTS) is a sampling-based framework that selects actions by solving perturbed linear programs.<n>We develop two main variants: S-COLTS, which ensures zero risk and $widetildeO(sqrtd3 T)$ regret given a safe action, and R-COLTS, which achieves $widetildeO(sqrtd3 T)$ regret and risk with no instance information.
arXiv Detail & Related papers (2025-03-03T20:44:58Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
C-Learner: Constrained Learning for Causal Inference [4.370964009390564]
We propose a novel debiasing approach that achieves the best weighting of both worlds, producing stable plug-in estimates.<n>Our constrained learning framework solves for the best plug-in estimator under the constraint that the first-order error with respect to the plugged-in quantity is zero.
arXiv Detail & Related papers (2024-05-15T16:38:28Z)
Equal Opportunity of Coverage in Fair Regression [50.76908018786335]
We study fair machine learning (ML) under predictive uncertainty to enable reliable and trustworthy decision-making. We propose Equal Opportunity of Coverage (EOC) that aims to achieve two properties: (1) coverage rates for different groups with similar outcomes are close, and (2) the coverage rate for the entire population remains at a predetermined level.
arXiv Detail & Related papers (2023-11-03T21:19:59Z)
On Lower Bounds for Standard and Robust Gaussian Process Bandit Optimization [55.937424268654645]
We consider algorithm-independent lower bounds for the problem of black-box optimization of functions having a bounded norm. We provide a novel proof technique for deriving lower bounds on the regret, with benefits including simplicity, versatility, and an improved dependence on the error probability.
arXiv Detail & Related papers (2020-08-20T03:48:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.