Related papers: Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

URL: http://arxiv.org/abs/2505.23729v2
Date: Sat, 31 May 2025 23:47:06 GMT
Title: Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time
Authors: Mohamad Chehade, Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Dinesh Manocha, Hao Zhu, Amrit Singh Bedi,
Abstract summary: We propose SITAlign, an inference framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria.<n>We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach.
Score: 52.230936493691985
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

Related papers

Risk-Averse Best Arm Set Identification with Fixed Budget and Fixed Confidence [0.562479170374811]
We introduce a novel problem setting in bandit optimization that addresses maximizing expected reward and minimizing associated uncertainty.<n>We propose a unified meta-budgetalgorithmic framework capable of operating under both fixed-confidence and fixed-optimal regimes.<n>Our approach outperforms existing methods in terms of both accuracy and sample efficiency.
arXiv Detail & Related papers (2025-06-27T14:21:03Z)
Robust Satisficing Gaussian Process Bandits Under Adversarial Attacks [7.701333337093469]
We consider a robust satisficing objective, where the goal is to consistently achieve a predefined performance threshold $tau$, even under adversarial conditions.<n>We propose two novel algorithms based on distinct formulations of robust satisficing, and show that they are instances of a general robust satisficing framework.<n>Specifically, we derive two regret bounds: one that is sublinear over time, assuming certain conditions on the adversary and the satisficing threshold $tau$, and another that scales with the perturbation magnitude but requires no assumptions on the adversary.
arXiv Detail & Related papers (2025-06-02T13:04:18Z)
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning [36.74546743563837]
We propose a novel framework that establishes dual quantitative metrics for preference selection.<n>Our code is publicly available at https://github.com/YunqiaoYang/PCPO.
arXiv Detail & Related papers (2025-05-29T15:20:44Z)
Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning [0.0]
We introduce a theoretical framework for preference-based Multi-Objective Inverse Reinforcement Learning (MO-IRL), where human preferences are modeled as latent vector-valued reward functions.<n>Our results bridge the gap between practical alignment techniques and theoretical guarantees, providing a principled foundation for learning aligned behaviors.
arXiv Detail & Related papers (2025-05-17T06:09:13Z)
Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning [16.10753846850319]
Continual learning allows models to persistently acquire and retain information.<n> catastrophic forgetting can severely impair model performance.<n>We introduce a novel framework termed Optimally-Weighted Mean Discrepancy (OWMMD), which imposes penalties on representation alterations.
arXiv Detail & Related papers (2025-01-21T13:33:45Z)
Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning [17.245293915129942]
The optimal objective is a fundamental aspect of reinforcement learning (RL)<n>While total return is ideal, discounted return is practical objective due to its stability.<n>We propose two alternative approaches to align the objectives.
arXiv Detail & Related papers (2024-07-18T08:33:10Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values. Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z)
MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.<n>We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.<n>Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback. We consider the general reward setting where the reward can be defined over the whole trajectory. We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z)
Bi-objective Ranking and Selection Using Stochastic Kriging [0.0]
We consider bi-objective ranking and selection problems in which the two objective outcomes have been observed with uncertainty. We propose a novel Bayesian bi-objective ranking and selection method that sequentially allocates extra samples to competitive solutions. Experimental results show that the proposed method outperforms the standard allocation method, as well as a well-known state-of-the-art algorithm.
arXiv Detail & Related papers (2022-09-05T23:51:07Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.