What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
- URL: http://arxiv.org/abs/2510.26202v1
- Date: Thu, 30 Oct 2025 07:25:10 GMT
- Title: What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
- Authors: Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson,
- Abstract summary: What's In My Human Feedback? (WIMHF) is a method to explain feedback data using sparse autoencoders.<n>WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express.
- Score: 20.75601428185122
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.
Related papers
- Towards Understanding Valuable Preference Data for Large Language Model Alignment [85.38864561060088]
Large language model (LLM) alignment is typically achieved through learning from human preference comparisons.<n>We assess data quality through individual influence on validation data using our newly proposed truncated influence function (TIF)<n>To this end, we combine them to offset their diverse error sources, resulting in a simple yet effective data selection rule.
arXiv Detail & Related papers (2025-10-15T06:57:55Z) - Policy Teaching via Data Poisoning in Learning from Human Preferences [24.645259298082436]
We study data poisoning attacks in learning from human preferences.<n>We study the problem of teaching/enforcing a target policy $pidagger$ by synthesizing preference data.
arXiv Detail & Related papers (2025-03-13T10:11:54Z) - Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs) for extracting diverse human preferences from binary comparisons.<n>DRMs represent preferences as vectors and analyze them using Principal Component Analysis (PCA)<n>DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z) - Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce HyPER, a Hybrid Preference routER that defers an annotation to either humans or language models (LMs)<n>We show that the selected hybrid mixture of synthetic and direct human preferences using HyPER achieves better RM performance compared to using either one exclusively by 7-13% on RewardBench.<n>We also analyze features from HyPER and find that prompts with moderate safety concerns or complexity benefit the most from human feedback.
arXiv Detail & Related papers (2024-10-24T20:04:15Z) - LRHP: Learning Representations for Human Preferences via Preference Pairs [45.056558199304554]
We introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences.
We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction.
arXiv Detail & Related papers (2024-10-06T14:48:28Z) - Data-Centric Human Preference with Rationales for Direct Preference Alignment [23.243583332894737]
We propose augmenting standard preference pairs with rationales that explain the reasoning behind the human preference.<n>Our comprehensive analysis demonstrates that incorporating rationales improves learning efficiency.<n>Our findings showcase the potential of thoughtful data design in preference learning.
arXiv Detail & Related papers (2024-07-19T17:27:52Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment [72.99676237703099]
We propose a new framework that boosts the alignment of large language models with human preferences.<n>Our key idea is leveraging the human prior knowledge within the small (seed) data.<n>We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Inverse Constitutional AI: Compressing Preferences into Principles [37.28372419588119]
We introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task.<n>In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models.<n>We propose a corresponding ICAI algorithm and validate its generated constitutions on several datasets.
arXiv Detail & Related papers (2024-06-02T11:54:50Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - Personalized Language Modeling from Personalized Human Feedback [45.16986573937782]
Personalized large language models (LLMs) are designed to tailor responses to individual user preferences.<n>We propose Personalized-RLHF, an efficient framework that utilizes a lightweight user model to capture individual user preferences.<n>We show that personalized LLMs trained using P-RLHF generate responses that are more closely aligned with individual user preferences.
arXiv Detail & Related papers (2024-02-06T04:18:58Z) - Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy.<n>We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound.<n>Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.