Related papers: VAL-Bench: Measuring Value Alignment in Language Models

VAL-Bench: Measuring Value Alignment in Language Models

URL: http://arxiv.org/abs/2510.05465v2
Date: Wed, 08 Oct 2025 01:35:03 GMT
Title: VAL-Bench: Measuring Value Alignment in Language Models
Authors: Aman Gupta, Denny O'Shea, Fazl Barez,
Abstract summary: Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions.<n>Existing benchmarks mostly track refusals or predefined safety violations but do not reveal whether a model upholds a coherent value system.<n>We introduce the Value ALignment Benchmark (VAL-Bench), which evaluates whether models maintain a stable value stance across paired that frame opposing sides of public debates.
Score: 10.745372809345412
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions, so it is critical to test whether their responses reflect consistent human values. Existing benchmarks mostly track refusals or predefined safety violations, but these only check rule compliance and do not reveal whether a model upholds a coherent value system when facing controversial real-world issues. We introduce the Value ALignment Benchmark (VAL-Bench), which evaluates whether models maintain a stable value stance across paired prompts that frame opposing sides of public debates. VAL-Bench consists of 115K such pairs from Wikipedia's controversial sections. A well-aligned model should express similar underlying views regardless of framing, which we measure using an LLM-as-judge to score agreement or divergence between paired responses. Applied across leading open- and closed-source models, the benchmark reveals large variation in alignment and highlights trade-offs between safety strategies (e.g., refusals) and more expressive value systems. By providing a scalable, reproducible benchmark, VAL-Bench enables systematic comparison of how reliably LLMs embody human values.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models [55.94503936470247]
Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including judges.<n>Most classical methods assume annotators are conditionally independent given the true label $Yin0,1$, an assumption often violated by LLM judges.<n>We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors.
arXiv Detail & Related papers (2026-01-29T21:26:50Z)
Uncovering Competency Gaps in Large Language Models and Their Benchmarks [11.572508874955659]
We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps.<n>We found that models consistently underperformed on concepts that stand in contrast to sycophantic behaviors.<n>Our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores.
arXiv Detail & Related papers (2025-12-06T17:39:47Z)
RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions [0.0]
Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions.<n>We introduce RoParQ, a benchmark to evaluate cross-paraphrase consistency in closed-book multiple-choice QA.<n>We also propose XParaCon, a novel evaluation metric that quantifies a model's robustness.
arXiv Detail & Related papers (2025-11-26T16:40:53Z)
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses [32.58830706120845]
Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance.<n>We introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques.<n>We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.
arXiv Detail & Related papers (2025-09-30T19:56:54Z)
EigenBench: A Comparative Behavioral Measure of Value Alignment [0.28707625120094377]
EigenBench is a black-box method for benchmarking language models' values.<n>It is designed to quantify subjective traits for which reasonable judges may disagree on the correct label.<n>It can recover model rankings on the GPQA benchmark without access to objective labels.
arXiv Detail & Related papers (2025-09-02T04:14:26Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Value Portrait: Assessing Language Models' Values through Psychometrically and Ecologically Valid Items [2.9357382494347264]
Existing benchmarks rely on human or machine annotations that are vulnerable to value-related biases.<n>We propose the Value Portrait benchmark, which consists of items that capture real-life user-LLM interactions.<n>This psychometrically validated approach ensures that items strongly correlated with specific values serve as reliable items for assessing those values.
arXiv Detail & Related papers (2025-05-02T05:26:50Z)
PairBench: Are Vision-Language Models Reliable at Comparing What They See? [16.49586486795478]
We present PairBench, a framework to evaluate large vision language models (VLMs) for automatic evaluation depending on the task.<n>Our approach introduces four key metrics for reliable comparison: alignment with human annotations, consistency across pair ordering, distribution smoothness, and controllability through prompting.<n>Our analysis reveals that no model consistently excels across all metrics, with each demonstrating distinct strengths and weaknesses.
arXiv Detail & Related papers (2025-02-21T04:53:11Z)
Value Compass Benchmarks: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs.<n> aligning their values with humans has become imperative for their responsible development.<n>There still lack evaluations of LLMs values that fulfill three desirable goals.
arXiv Detail & Related papers (2025-01-13T05:53:56Z)
Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks.<n>They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct.<n>We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Uncertainty in Language Models: Assessment through Rank-Calibration [65.10149293133846]
Language Models (LMs) have shown promising performance in natural language generation. It is crucial to correctly quantify their uncertainty in responding to given inputs. We develop a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs.
arXiv Detail & Related papers (2024-04-04T02:31:05Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.