Related papers: From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM

From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM

URL: http://arxiv.org/abs/2510.12689v1
Date: Tue, 14 Oct 2025 16:24:19 GMT
Title: From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM
Authors: Suyash Fulay, Jocelyn Zhu, Michiel Bakker,
Abstract summary: We study whether AI systems should act as delegates, mirroring expressed preferences, or as trustees.<n>We find that trustee-style predictions weighted toward long-term interests produce policy decisions that align more closely with expert consensus on well-understood issues.<n>These findings reveal a fundamental trade-off in designing AI systems to represent human interests.
Score: 0.5902684051239003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown promising accuracy in predicting survey responses and policy preferences, which has increased interest in their potential to represent human interests in various domains. Most existing research has focused on behavioral cloning, effectively evaluating how well models reproduce individuals' expressed preferences. Drawing on theories of political representation, we highlight an underexplored design trade-off: whether AI systems should act as delegates, mirroring expressed preferences, or as trustees, exercising judgment about what best serves an individual's interests. This trade-off is closely related to issues of LLM sycophancy, where models can encourage behavior or validate beliefs that may be aligned with a user's short-term preferences, but is detrimental to their long-term interests. Through a series of experiments simulating votes on various policy issues in the U.S. context, we apply a temporal utility framework that weighs short and long-term interests (simulating a trustee role) and compare voting outcomes to behavior-cloning models (simulating a delegate). We find that trustee-style predictions weighted toward long-term interests produce policy decisions that align more closely with expert consensus on well-understood issues, but also show greater bias toward models' default stances on topics lacking clear agreement. These findings reveal a fundamental trade-off in designing AI systems to represent human interests. Delegate models better preserve user autonomy but may diverge from well-supported policy positions, while trustee models can promote welfare on well-understood issues yet risk paternalism and bias on subjective topics.

Related papers

Preference Learning for AI Alignment: a Causal Perspective [55.2480439325792]
We frame this problem in a causal paradigm, providing the rich toolbox of causality to identify persistent challenges.<n>Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation.<n>We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness.
arXiv Detail & Related papers (2025-06-06T10:45:42Z)
What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context [56.590259941275434]
RecPO is a preference optimization framework for sequential recommendation.<n>It exploits adaptive reward margins based on inferred preference hierarchies and temporal signals.<n>It mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
arXiv Detail & Related papers (2025-06-02T21:09:29Z)
Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models [6.9347404883379316]
Large Language Models (LLMs) are increasingly integrated into human life and increasingly influence decision-making.<n>It's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs.<n>This paper presents the Preference, Opinion, and Belief survey (POBs) to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains.
arXiv Detail & Related papers (2025-05-26T07:41:21Z)
Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude [8.959468665453286]
This study systematically evaluates how nine popular Large Language Models respond to ethical dilemmas involving protected attributes.<n>Across 50,400 trials spanning single and intersectional attribute combinations, we assess models' ethical preferences, sensitivity, stability, and clustering patterns.<n>Results reveal significant biases in protected attributes in all models, with differing preferences depending on model type and dilemma context.
arXiv Detail & Related papers (2025-01-17T05:20:38Z)
Beyond Partisan Leaning: A Comparative Analysis of Political Bias in Large Language Models [6.549047699071195]
This study adopts a persona-free, topic-specific approach to evaluate political behavior in large language models.<n>We analyze responses from 43 large language models developed in the U.S., Europe, China, and the Middle East.<n>Findings show most models lean center-left or left ideologically and vary in their nonpartisan engagement patterns.
arXiv Detail & Related papers (2024-12-21T19:42:40Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Biased AI can Influence Political Decision-Making [64.9461133083473]
This paper presents two experiments investigating the effects of partisan bias in large language models (LLMs) on political opinions and decision-making.<n>We found that participants exposed to partisan biased models were significantly more likely to adopt opinions and make decisions which matched the LLM's bias.
arXiv Detail & Related papers (2024-10-08T22:56:00Z)
Long-Term Fairness in Sequential Multi-Agent Selection with Positive Reinforcement [21.44063458579184]
In selection processes such as college admissions or hiring, biasing slightly towards applicants from under-represented groups is hypothesized to provide positive feedback. We propose the Multi-agent Fair-Greedy policy, which balances greedy score and fairness. Our results indicate that, while positive reinforcement is a promising mechanism for long-term fairness, policies must be designed carefully to be robust to variations in the evolution model.
arXiv Detail & Related papers (2024-07-10T04:03:23Z)
Whose Side Are You On? Investigating the Political Stance of Large Language Models [56.883423489203786]
We investigate the political orientation of Large Language Models (LLMs) across a spectrum of eight polarizing topics. Our investigation delves into the political alignment of LLMs across a spectrum of eight polarizing topics, spanning from abortion to LGBTQ issues. The findings suggest that users should be mindful when crafting queries, and exercise caution in selecting neutral prompt language.
arXiv Detail & Related papers (2024-03-15T04:02:24Z)
Joint Optimization of AI Fairness and Utility: A Human-Centered Approach [45.04980664450894]
We argue that because different fairness criteria sometimes cannot be simultaneously satisfied, it is key to acquire and adhere to human policy makers' preferences on how to make the tradeoff among these objectives. We propose a framework and some exemplar methods for eliciting such preferences and for optimizing an AI model according to these preferences.
arXiv Detail & Related papers (2020-02-05T03:31:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.