Related papers: QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

URL: http://arxiv.org/abs/2506.08123v1
Date: Mon, 09 Jun 2025 18:24:57 GMT
Title: QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA
Authors: Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou,
Abstract summary: We introduce QA-LIGN, an automatic symbolic reward decomposition approach.<n>Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions.<n>Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability.
Score: 49.9801383018588
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment [39.965170904699974]
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback.<n>Current RLVR methods treat whole responses as single actions, assigning the same reward to every token.<n>This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure.
arXiv Detail & Related papers (2025-08-04T11:06:08Z)
An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models [18.62332474172811]
Large Language Models (LLMs) have demonstrated remarkable progress in instruction following and general-purpose reasoning.<n>High-quality alignment with human intent and safety norms without human annotations remains a fundamental challenge.<n>We propose an Uncertainty-Driven Adaptive Self-Alignment framework designed to improve LLM alignment in a fully automated manner.
arXiv Detail & Related papers (2025-07-23T13:00:00Z)
Feature-Based vs. GAN-Based Learning from Demonstrations: When and Why [50.191655141020505]
This survey provides a comparative analysis of feature-based and GAN-based approaches to learning from demonstrations.<n>We argue that the dichotomy between feature-based and GAN-based methods is increasingly nuanced.
arXiv Detail & Related papers (2025-07-08T11:45:51Z)
Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards [1.1981384995161284]
We propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards.<n>We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm.<n>Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning.
arXiv Detail & Related papers (2025-05-30T14:34:57Z)
Latent Principle Discovery for Language Model Self-Improvement [14.137106102563514]
We propose eliciting latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting.<n>Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering.<n>We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval.
arXiv Detail & Related papers (2025-05-22T17:20:18Z)
Closing the Intent-to-Behavior Gap via Fulfillment Priority Logic [1.4542411354617986]
This paper presents the concept of objective fulfillment upon which we build Fulfillment Priority Logic (FPL)<n>Our novel Balanced Policy Gradient algorithm leverages FPL specifications to achieve up to 500% better sample efficiency compared to Soft Actor Critic.
arXiv Detail & Related papers (2025-03-04T18:45:20Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation [29.6763730290473]
Reinforcement learning can align language models with non-differentiable reward signals, such as human preferences. This paper introduces a novel framework that utilizes the critique capability of Large Language Models to produce intermediate-step rewards.
arXiv Detail & Related papers (2024-01-14T22:05:11Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
From STL Rulebooks to Rewards [4.859570041295978]
We propose a principled approach to shaping rewards for reinforcement learning from multiple objectives. We first equip STL with a novel quantitative semantics allowing to automatically evaluate individual requirements. We then develop a method for systematically combining evaluations of multiple requirements into a single reward.
arXiv Detail & Related papers (2021-10-06T14:16:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.