Related papers: Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

URL: http://arxiv.org/abs/2511.13290v1
Date: Mon, 17 Nov 2025 12:13:15 GMT
Title: Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment
Authors: Jea Kwon, Luiz Felipe Vecchietti, Sungwon Park, Meeyoung Cha,
Abstract summary: Humans display significant uncertainty when confronted with moral dilemmas.<n>Recent studies have confirmed the overly confident tendencies of machine-generated responses.<n>This work examines how uncertainty influences moral decisions in the classical trolley problem.
Score: 18.3236201998655
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via "dropout" at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs' confidence in morally complex scenarios.

Related papers

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies [57.387081435669835]
Multi-agent systems built from large language models offer a promising paradigm for scalable collective intelligence and self-evolution.<n>We show that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible.<n>We propose several solution directions to alleviate the identified safety concern.
arXiv Detail & Related papers (2026-02-10T15:18:19Z)
Moral Sycophancy in Vision Language Models [4.1673509006222655]
Sycophancy in Vision-Language Models (VLMs) refers to their tendency to align with user opinions, often at the expense of moral or factual accuracy.<n>We analyze ten widely-used models on the Moralise and M3oralBench datasets under explicit user disagreement.
arXiv Detail & Related papers (2026-02-09T06:34:12Z)
From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z)
Explaining Machine Learning Predictive Models through Conditional Expectation Methods [0.0]
MUCE is a model-agnostic method for local explainability designed to capture prediction changes from feature interactions.<n>Two quantitative indices, stability and uncertainty, summarize local behavior and assess model reliability.<n>Results show that MUCE effectively captures complex local model behavior, while the stability and uncertainty indices provide meaningful insight into prediction confidence.
arXiv Detail & Related papers (2026-01-12T08:34:36Z)
When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs [15.617378124319472]
Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information.<n>We introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty and inherent modality preference.
arXiv Detail & Related papers (2025-11-04T04:11:31Z)
On the Convergence of Moral Self-Correction in Large Language Models [26.724972162483855]
Large Language Models (LLMs) are able to improve their responses when instructed to do so.<n>LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction.<n>We reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions.
arXiv Detail & Related papers (2025-10-08T17:46:27Z)
Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models [14.425718737962102]
We propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment.<n>Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability.<n>Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity.
arXiv Detail & Related papers (2025-06-17T15:22:21Z)
Information Retrieval Induced Safety Degradation in AI Agents [52.15553901577888]
This study investigates how expanding retrieval access affects model reliability, bias propagation, and harmful content generation.<n>Retrieval-enabled agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval.<n>These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-enabled and increasingly autonomous AI systems.
arXiv Detail & Related papers (2025-05-20T11:21:40Z)
Superficial Self-Improved Reasoners Benefit from Model Merging [49.09091498084467]
Self-improvement as a solution to synthesizing high-quality data corpus.<n>In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities.<n>We propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization.
arXiv Detail & Related papers (2025-03-03T22:41:25Z)
Uncertainty Quantification for Forward and Inverse Problems of PDEs via Latent Global Evolution [110.99891169486366]
We propose a method that integrates efficient and precise uncertainty quantification into a deep learning-based surrogate model. Our method endows deep learning-based surrogate models with robust and efficient uncertainty quantification capabilities for both forward and inverse problems. Our method excels at propagating uncertainty over extended auto-regressive rollouts, making it suitable for scenarios involving long-term predictions.
arXiv Detail & Related papers (2024-02-13T11:22:59Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination" We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z)
The Unreasonable Effectiveness of Deep Evidential Regression [72.30888739450343]
A new approach with uncertainty-aware regression-based neural networks (NNs) shows promise over traditional deterministic methods and typical Bayesian NNs. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a quantification rather than an exact uncertainty.
arXiv Detail & Related papers (2022-05-20T10:10:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.