Related papers: EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

URL: http://arxiv.org/abs/2510.05942v2
Date: Wed, 08 Oct 2025 08:03:38 GMT
Title: EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
Authors: Hadi Mohammadi, Anastasia Giachanou, Ayoub Bagheri,
Abstract summary: EvalMORAAL is a transparent chain-of-thought framework to evaluate moral alignment in 20 large language models.<n>We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics)
Score: 1.141545154221656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's r approximately 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating consistent regional bias. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured chain-of-thought protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39, both p<.001), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

Related papers

The Global Representativeness Index: A Total Variation Distance Framework for Measuring Demographic Fidelity in Survey Research [0.0]
Survey research increasingly informs high-stakes decisions in AI governance and cross-cultural policy.<n>No standardized metric quantifies how well a sample's demographic composition matches its target population.<n>This paper introduces the Global Representativeness Index (GRI), a framework grounded in Total Variation Distance.
arXiv Detail & Related papers (2026-02-16T15:26:52Z)
Are Aligned Large Language Models Still Misaligned? [13.062124372682106]
Mis-Align Bench is a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions.<n> SAVACU is an English-aligned dataset of 382,424 misaligned samples spanning 112 domains (or labels)
arXiv Detail & Related papers (2026-02-11T19:30:43Z)
Regional Bias in Large Language Models [0.0]
Regional bias in large language models (LLMs) is an emerging concern in AI fairness and global representation.<n>We evaluate ten prominent LLMs using prompts that probe forced-choice decisions between regions under contextually neutral scenarios.<n>We introduce FAZE, a prompt-based evaluation framework that measures regional bias on a 10-point scale, where higher scores indicate a stronger tendency to favor specific regions.
arXiv Detail & Related papers (2026-01-22T22:22:23Z)
Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs [13.630995219491972]
We propose a framework for evaluating the representativeness of aligned models.<n>We show the value of our evaluation scheme by comparing two model steering techniques.<n>We conclude that representativeness is a distinct aspect of value alignment.
arXiv Detail & Related papers (2026-01-22T08:45:55Z)
Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z)
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4. The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z)
Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other. We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.