Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time
- URL: http://arxiv.org/abs/2505.23729v2
- Date: Sat, 31 May 2025 23:47:06 GMT
- Title: Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time
- Authors: Mohamad Chehade, Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Dinesh Manocha, Hao Zhu, Amrit Singh Bedi,
- Abstract summary: We propose SITAlign, an inference framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria.<n>We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach.
- Score: 52.230936493691985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.
Related papers
- Risk-Averse Best Arm Set Identification with Fixed Budget and Fixed Confidence [0.562479170374811]
We introduce a novel problem setting in bandit optimization that addresses maximizing expected reward and minimizing associated uncertainty.<n>We propose a unified meta-budgetalgorithmic framework capable of operating under both fixed-confidence and fixed-optimal regimes.<n>Our approach outperforms existing methods in terms of both accuracy and sample efficiency.
arXiv Detail & Related papers (2025-06-27T14:21:03Z) - Robust Satisficing Gaussian Process Bandits Under Adversarial Attacks [7.701333337093469]
We consider a robust satisficing objective, where the goal is to consistently achieve a predefined performance threshold $tau$, even under adversarial conditions.<n>We propose two novel algorithms based on distinct formulations of robust satisficing, and show that they are instances of a general robust satisficing framework.<n>Specifically, we derive two regret bounds: one that is sublinear over time, assuming certain conditions on the adversary and the satisficing threshold $tau$, and another that scales with the perturbation magnitude but requires no assumptions on the adversary.
arXiv Detail & Related papers (2025-06-02T13:04:18Z) - Probability-Consistent Preference Optimization for Enhanced LLM Reasoning [36.74546743563837]
We propose a novel framework that establishes dual quantitative metrics for preference selection.<n>Our code is publicly available at https://github.com/YunqiaoYang/PCPO.
arXiv Detail & Related papers (2025-05-29T15:20:44Z) - Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning [0.0]
We introduce a theoretical framework for preference-based Multi-Objective Inverse Reinforcement Learning (MO-IRL), where human preferences are modeled as latent vector-valued reward functions.<n>Our results bridge the gap between practical alignment techniques and theoretical guarantees, providing a principled foundation for learning aligned behaviors.
arXiv Detail & Related papers (2025-05-17T06:09:13Z) - Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning [16.10753846850319]
Continual learning allows models to persistently acquire and retain information.<n> catastrophic forgetting can severely impair model performance.<n>We introduce a novel framework termed Optimally-Weighted Mean Discrepancy (OWMMD), which imposes penalties on representation alterations.
arXiv Detail & Related papers (2025-01-21T13:33:45Z) - Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning [17.245293915129942]
The optimal objective is a fundamental aspect of reinforcement learning (RL)<n>While total return is ideal, discounted return is practical objective due to its stability.<n>We propose two alternative approaches to align the objectives.
arXiv Detail & Related papers (2024-07-18T08:33:10Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.<n>We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.<n>Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Bi-objective Ranking and Selection Using Stochastic Kriging [0.0]
We consider bi-objective ranking and selection problems in which the two objective outcomes have been observed with uncertainty.
We propose a novel Bayesian bi-objective ranking and selection method that sequentially allocates extra samples to competitive solutions.
Experimental results show that the proposed method outperforms the standard allocation method, as well as a well-known state-of-the-art algorithm.
arXiv Detail & Related papers (2022-09-05T23:51:07Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.