Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding
- URL: http://arxiv.org/abs/2508.13804v2
- Date: Mon, 22 Sep 2025 17:59:40 GMT
- Title: Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding
- Authors: Maciej Skorski, Alina Landowska,
- Abstract summary: We model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity)<n>We evaluate the best language models across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums.<n>Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25% of human annotators, performing much better than average balanced accuracy.
- Score: 1.7635992653738075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.
Related papers
- Bayesian Social Deduction with Graph-Informed Language Models [3.7540464038118633]
Social reasoning remains a challenging task for large language models.<n>We introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model.<n>Our approach achieves competitive performance with much larger models in Agent-Agent play.
arXiv Detail & Related papers (2025-06-21T18:45:28Z) - Empirically evaluating commonsense intelligence in large language models with large-scale human judgments [4.7206754497888035]
We propose a novel method for evaluating common sense in artificial intelligence.<n>We measure the correspondence between a model's judgment and that of a human population.<n>Our framework contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
arXiv Detail & Related papers (2025-05-15T13:55:27Z) - A suite of LMs comprehend puzzle statements as well as humans [13.386647125288516]
We report a preregistered study comparing human responses in two conditions: one allowed rereading, and one that restricted rereading.<n>Human accuracy dropped significantly when rereading was restricted, falling below that of Falcon-180B-Chat and GPT-4.<n>Results suggest shared pragmatic sensitivities rather than model-specific deficits.
arXiv Detail & Related papers (2025-05-13T22:18:51Z) - Who is More Bayesian: Humans or ChatGPT? [0.0]
We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith.<n>We confirm that while overall, Bayes Rule represents the single best model for predicting human choices, subjects are heterogeneous.<n>We show that ChatGPT is also subject to biases that result in suboptimal decisions.
arXiv Detail & Related papers (2025-04-14T18:37:54Z) - Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI [95.81924314159943]
We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity.<n>We also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source.
arXiv Detail & Related papers (2025-02-17T09:56:46Z) - One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books.
Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify.
On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z) - Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans Due to Impenetrable Semantic Reference [1.8434042562191815]
This work investigates the role of model scaling in determining whether differences between humans and models are amenable to model size.<n>We test three Large Language Models (LLMs) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity.<n>We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences.
arXiv Detail & Related papers (2024-04-23T10:09:46Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - CommonsenseQA 2.0: Exposing the Limits of AI through Gamification [126.85096257968414]
We construct benchmarks that test the abilities of modern natural language understanding models.
In this work, we propose gamification as a framework for data construction.
arXiv Detail & Related papers (2022-01-14T06:49:15Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - Scaling Language Models: Methods, Analysis & Insights from Training
Gopher [83.98181046650664]
We present an analysis of Transformer-based language model performance across a wide range of model scales.
Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language.
We discuss the application of language models to AI safety and the mitigation of downstream harms.
arXiv Detail & Related papers (2021-12-08T19:41:47Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.