Related papers: Inverse Constitutional AI: Compressing Preferences into Principles

Inverse Constitutional AI: Compressing Preferences into Principles

URL: http://arxiv.org/abs/2406.06560v2
Date: Mon, 21 Apr 2025 15:37:15 GMT
Title: Inverse Constitutional AI: Compressing Preferences into Principles
Authors: Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, Samuel Albanie, Robert Mullins,
Abstract summary: We introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task.<n>In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models.<n>We propose a corresponding ICAI algorithm and validate its generated constitutions on several datasets.
Score: 37.28372419588119
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Feedback data is widely used for fine-tuning and evaluating state-of-the-art AI models. Pairwise text preferences, where human or AI annotators select the "better" of two options, are particularly common. Such preferences are used to train (reward) models or to rank models with aggregate statistics. For many applications it is desirable to understand annotator preferences in addition to modelling them - not least because extensive prior work has shown various unintended biases in preference datasets. Yet, preference datasets remain challenging to interpret. Neither black-box reward models nor statistics can answer why one text is preferred over another. Manual interpretation of the numerous (long) response pairs is usually equally infeasible. In this paper, we introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task. In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models. ICAI inverts this process: given a feedback dataset, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding ICAI algorithm and validate its generated constitutions quantitatively based on annotation reconstruction accuracy on several datasets: (a) synthetic feedback data with known principles; (b) AlpacaEval cross-annotated human feedback data; (c) crowdsourced Chatbot Arena data; and (d) PRISM data from diverse demographic groups. As a short and interpretable representation of the original dataset, generated constitutions have many potential use cases: help identify undesirable annotator biases, understand model performance better, scale feedback to unseen data, or adapt models to individual user or group preferences. We release the source code at https://github.com/rdnfn/icai.

Related papers

Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA) DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z)
How to Select Datapoints for Efficient Human Evaluation of NLG Models? [57.60407340254572]
We develop a suite of selectors to get the most informative datapoints for human evaluation. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
arXiv Detail & Related papers (2025-01-30T10:33:26Z)
Reward Modeling with Ordinal Feedback: Wisdom of the Crowd [9.034189257088762]
Learning a reward model (RM) from human preferences has been an important component in aligning large language models. We propose a framework for learning RMs under ordinal feedback. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity.
arXiv Detail & Related papers (2024-11-19T20:17:04Z)
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality. We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations. We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z)
ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models. We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z)
Data-Centric Human Preference with Rationales for Direct Preference Alignment [23.243583332894737]
We propose augmenting standard preference pairs with rationales that explain the reasoning behind the human preference.<n>Our comprehensive analysis demonstrates that incorporating rationales improves learning efficiency.<n>Our findings showcase the potential of thoughtful data design in preference learning.
arXiv Detail & Related papers (2024-07-19T17:27:52Z)
Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference. Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z)
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences [6.398937923320069]
We propose PAL, a framework to model human preference complementary to existing pretraining strategies. We show that PAL achieves competitive reward model accuracy compared to strong baselines.
arXiv Detail & Related papers (2024-06-12T17:54:54Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
We propose a new axis based on eliciting preferences jointly over instruction-response pairs.<n>Joint preferences over instruction and response pairs can significantly enhance the alignment of large language models.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning [38.30983556062276]
A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. Crowdsourcing introduces issues related to the annotator's experience, consistency, and biases. This paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning.
arXiv Detail & Related papers (2024-01-24T04:57:32Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy.<n>We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound.<n>Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE) It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z)
UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z)
Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences [1.8592384822257952]
We propose a novel topic modeling and inference algorithm. We leverage pre-trained sentence embeddings by combining generative process models and clustering. TheTailor evaluation shows that our method yields state-of-the art results with relatively little computational demands.
arXiv Detail & Related papers (2023-02-06T20:13:11Z)
Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity [30.647497555295974]
We train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. The distilled information from the classifier is then used to train a reliable sentence embedding model. Our model trained on synthetic data generalizes well and outperforms the existing baselines.
arXiv Detail & Related papers (2022-08-29T05:42:22Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.