Fine-tuning language models to find agreement among humans with diverse
preferences
- URL: http://arxiv.org/abs/2211.15006v1
- Date: Mon, 28 Nov 2022 02:24:14 GMT
- Title: Fine-tuning language models to find agreement among humans with diverse
preferences
- Authors: Michiel A. Bakker and Martin J. Chadwick and Hannah R. Sheahan and
Michael Henry Tessler and Lucy Campbell-Gillingham and Jan Balaguer and Nat
McAleese and Amelia Glaese and John Aslanides and Matthew M. Botvinick and
Christopher Summerfield
- Abstract summary: Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user.
Here, we consider how might a machine help people with diverse views find agreement?
We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions.
We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent.
- Score: 7.702628192754256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work in large language modeling (LLMs) has used fine-tuning to align
outputs with the preferences of a prototypical user. This work assumes that
human preferences are static and homogeneous across individuals, so that
aligning to a a single "generic" user will confer more general alignment. Here,
we embrace the heterogeneity of human preferences to consider a different
challenge: how might a machine help people with diverse views find agreement?
We fine-tune a 70 billion parameter LLM to generate statements that maximize
the expected approval for a group of people with potentially diverse opinions.
Human participants provide written opinions on thousands of questions touching
on moral and political issues (e.g., "should we raise taxes on the rich?"), and
rate the LLM's generated candidate consensus statements for agreement and
quality. A reward model is then trained to predict individual preferences,
enabling it to quantify and rank consensus statements in terms of their appeal
to the overall group, defined according to different aggregation (social
welfare) functions. The model produces consensus statements that are preferred
by human users over those from prompted LLMs (>70%) and significantly
outperforms a tight fine-tuned baseline that lacks the final ranking step.
Further, our best model's consensus statements are preferred over the best
human-generated opinions (>65%). We find that when we silently constructed
consensus statements from only a subset of group members, those who were
excluded were more likely to dissent, revealing the sensitivity of the
consensus to individual contributions. These results highlight the potential to
use LLMs to help groups of humans align their values with one another.
Related papers
- ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models.
We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z) - Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes.
We find that the majority of disagreements are in opposition with standard reward modeling approaches.
We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z) - Intuitions of Compromise: Utilitarianism vs. Contractualism [42.3322948655612]
We use a paradigm that applies algorithms to aggregating preferences across groups in a social decision-making context.
While the dominant approach to value aggregation up to now has been utilitarian, we find that people strongly prefer the aggregations recommended by the contractualist algorithm.
arXiv Detail & Related papers (2024-10-07T21:05:57Z) - Evaluating Large Language Model Biases in Persona-Steered Generation [26.92498998306013]
We show that large language models (LLMs) are 9.7% less steerable towards incongruous personas than congruous ones.
Models that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women.
arXiv Detail & Related papers (2024-05-30T17:06:03Z) - Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content.
We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - On the steerability of large language models toward data-driven personas [98.9138902560793]
Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented.
Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs.
arXiv Detail & Related papers (2023-11-08T19:01:13Z) - Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties [68.66719970507273]
Value pluralism is the view that multiple correct values may be held in tension with one another.
As statistical learners, AI systems fit to averages by default, washing out potentially irreducible value conflicts.
We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations.
arXiv Detail & Related papers (2023-09-02T01:24:59Z) - Aligning Language Models to User Opinions [10.953326025836475]
We find that the opinions of a user and their demographics and ideologies are not mutual predictors.
We use this insight to align LLMs by modeling both user opinions as well as user demographics and ideology.
In addition to the typical approach of prompting LLMs with demographics and ideology, we discover that utilizing the most relevant past opinions from individual users enables the model to predict user opinions more accurately.
arXiv Detail & Related papers (2023-05-24T09:11:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.