Related papers: Inverse Constitutional AI: Compressing Preferences into Principles

Inverse Constitutional AI: Compressing Preferences into Principles

URL: http://arxiv.org/abs/2406.06560v1
Date: Sun, 2 Jun 2024 11:54:50 GMT
Title: Inverse Constitutional AI: Compressing Preferences into Principles
Authors: Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, Samuel Albanie, Robert Mullins,
Abstract summary: We look at the Inverse Constitutional AI (ICAI) problem. In ICAI, a set of principles is used to provide feedback and fine-tune AI models. We propose an initial ICAI algorithm and validate its generated constitutions.
Score: 37.28372419588119
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Feedback data plays an important role in fine-tuning and evaluating state-of-the-art AI models. Often pairwise text preferences are used: given two texts, human (or AI) annotators select the "better" one. Such feedback data is widely used to align models to human preferences (e.g., reinforcement learning from human feedback), or to rank models according to human preferences (e.g., Chatbot Arena). Despite its wide-spread use, prior work has demonstrated that human-annotated pairwise text preference data often exhibits unintended biases. For example, human annotators have been shown to prefer assertive over truthful texts in certain contexts. Models trained or evaluated on this data may implicitly encode these biases in a manner hard to identify. In this paper, we formulate the interpretation of existing pairwise text preference data as a compression task: the Inverse Constitutional AI (ICAI) problem. In constitutional AI, a set of principles (or constitution) is used to provide feedback and fine-tune AI models. The ICAI problem inverts this process: given a dataset of feedback, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding initial ICAI algorithm and validate its generated constitutions quantitatively based on reconstructed annotations. Generated constitutions have many potential use-cases -- they may help identify undesirable biases, scale feedback to unseen data or assist with adapting LLMs to individual user preferences. We demonstrate our approach on a variety of datasets: (a) synthetic feedback datasets with known underlying principles; (b) the AlpacaEval dataset of cross-annotated human feedback; and (c) the crowdsourced Chatbot Arena data set. We release the code for our algorithm and experiments at https://github.com/rdnfn/icai .

Related papers

Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA) DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z)
How to Select Datapoints for Efficient Human Evaluation of NLG Models? [57.60407340254572]
We develop a suite of selectors to get the most informative datapoints for human evaluation. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
arXiv Detail & Related papers (2025-01-30T10:33:26Z)
Reward Modeling with Ordinal Feedback: Wisdom of the Crowd [9.034189257088762]
Learning a reward model (RM) from human preferences has been an important component in aligning large language models. We propose a framework for learning RMs under ordinal feedback. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity.
arXiv Detail & Related papers (2024-11-19T20:17:04Z)
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality. We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations. We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z)
ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models. We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z)
Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference. Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z)
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences [6.398937923320069]
We propose PAL, a framework to model human preference complementary to existing pretraining strategies. We show that PAL achieves competitive reward model accuracy compared to strong baselines.
arXiv Detail & Related papers (2024-06-12T17:54:54Z)
From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning [38.30983556062276]
A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. Crowdsourcing introduces issues related to the annotator's experience, consistency, and biases. This paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning.
arXiv Detail & Related papers (2024-01-24T04:57:32Z)
RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE) It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z)
UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z)
Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences [1.8592384822257952]
We propose a novel topic modeling and inference algorithm. We leverage pre-trained sentence embeddings by combining generative process models and clustering. TheTailor evaluation shows that our method yields state-of-the art results with relatively little computational demands.
arXiv Detail & Related papers (2023-02-06T20:13:11Z)
Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity [30.647497555295974]
We train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. The distilled information from the classifier is then used to train a reliable sentence embedding model. Our model trained on synthetic data generalizes well and outperforms the existing baselines.
arXiv Detail & Related papers (2022-08-29T05:42:22Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.