Related papers: VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

URL: http://arxiv.org/abs/2602.03160v1
Date: Tue, 03 Feb 2026 06:19:57 GMT
Title: VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models
Authors: Woojin Kim, Sieun Hyeon, Jusang Oh, Jaeyoung Do,
Abstract summary: VALUEFLOW is a framework that spans extraction, evaluation, and steering with calibrated intensity control.<n>We conduct a large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control.
Score: 9.511622126333105
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and the steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, the first unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HIVES, a hierarchical value embedding space that captures intra- and cross-theory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of value-labeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchor-based evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

Related papers

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities [75.10343190811592]
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains.<n>Our benchmark offers a principled and interpretable framework for safe and controllable behavior.
arXiv Detail & Related papers (2026-03-03T03:50:13Z)
LIBERO-X: Robustness Litmus for Vision-Language-Action Models [32.29541801424534]
This work systematically rethinks VLA benchmarking from both evaluation and data perspectives.<n>We introduce LIBERO-X, a benchmark featuring a hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities.<n> Experiments with representative VLA models reveal significant performance drops under cumulative perturbations.
arXiv Detail & Related papers (2026-02-06T09:59:12Z)
DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training [94.568675548967]
Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain generalization.<n>Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar.<n>We propose DFPO, a robust distributional RL framework that models values as continuous flows across time steps.
arXiv Detail & Related papers (2026-02-05T17:07:42Z)
Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals [13.89434979851652]
Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs.<n>We present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction.
arXiv Detail & Related papers (2026-02-01T02:35:59Z)
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
Value Compass Benchmarks: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs.<n> aligning their values with humans has become imperative for their responsible development.<n>There still lack evaluations of LLMs values that fulfill three desirable goals.
arXiv Detail & Related papers (2025-01-13T05:53:56Z)
CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses [34.77031649891843]
We introduce CLAVE, a novel framework which integrates two complementary Large Language Models (LLMs) This dual-model approach enables calibration with any value systems using 100 human-labeled samples per value type. We present ValEval, a comprehensive dataset comprising 13k+ (text,value,label) 12+s across diverse domains, covering three major value systems.
arXiv Detail & Related papers (2024-07-15T13:51:37Z)
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference. HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators. Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.