Related papers: Deliberative Dynamics and Value Alignment in LLM Debates

Deliberative Dynamics and Value Alignment in LLM Debates

URL: http://arxiv.org/abs/2510.10002v1
Date: Sat, 11 Oct 2025 04:06:07 GMT
Title: Deliberative Dynamics and Value Alignment in LLM Debates
Authors: Pratik S. Sachdeva, Tom van Nuenen,
Abstract summary: We examine deliberative dynamics and value alignment in multi-turn settings using large language models.<n>We test order effects and verdict revision in 1,000 dilemmas from Reddit's "Am I the Asshole" community.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are increasingly deployed in sensitive everyday contexts - offering personal advice, mental health support, and moral guidance - understanding their elicited values in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn settings where values emerge through dialogue, revision, and consensus. We address this gap using LLM debate to examine deliberative dynamics and value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's "Am I the Asshole" community. We use both synchronous (parallel responses) and round-robin (sequential responses) formats to test order effects and verdict revision. Our findings show striking behavioral differences. In the synchronous setting, GPT showed strong inertia (0.6-3.1% revision rates) while Claude and Gemini were far more flexible (28-41%). Value patterns also diverged: GPT emphasized personal autonomy and direct communication, while Claude and Gemini prioritized empathetic dialogue. Certain values proved especially effective at driving verdict changes. We further find that deliberation format had a strong impact on model behavior: GPT and Gemini stood out as highly conforming relative to Claude, with their verdict behavior strongly shaped by order effects. These results show how deliberation format and model-specific behaviors shape moral reasoning in multi-turn interactions, underscoring that sociotechnical alignment depends on how systems structure dialogue as much as on their outputs.

Related papers

Same Answer, Different Representations: Hidden instability in VLMs [65.36933543377346]
We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness.<n>We apply this framework to modern Vision Language Models (VLMs) across the SEEDBench, MMMU, and POPE datasets.
arXiv Detail & Related papers (2026-02-06T12:24:26Z)
Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA [0.6263481844384227]
This work proposes a systematic evaluation framework to examine how interaction tone affects model accuracy.<n>We apply this framework to three recently released and widely available large language models: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta)<n>Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks.
arXiv Detail & Related papers (2025-12-14T19:25:20Z)
AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics [0.17240671897505613]
Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility.<n>We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse.
arXiv Detail & Related papers (2025-11-12T22:35:36Z)
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework [51.50565654314582]
Large language models can follow users' instructions throughout a dialogue spanning multiple topics.<n>Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience.<n>We propose a framework for assessing multi-turn instruction-following ability.
arXiv Detail & Related papers (2025-11-05T14:39:59Z)
CAPE: Context-Aware Personality Evaluation Framework for Large Language Models [8.618075786777219]
We propose the first Context-Aware Personality Evaluation framework for Large Language Models (LLMs)<n>Our experiments reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts.<n>Our framework can be applied to Role Playing Agents (RPAs) to better align with human judgments.
arXiv Detail & Related papers (2025-08-28T03:17:47Z)
Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z)
Can LLMs Talk 'Sex'? Exploring How AI Models Handle Intimate Conversations [0.0]
This study examines how four prominent large language models handle sexually oriented requests through qualitative content analysis.<n>Claude 3.7 Sonnet employs strict and consistent prohibitions, while GPT-4o navigates user interactions through nuanced contextual redirection.<n> Gemini 2.5 Flash exhibits permissiveness with threshold-based limits, and Deepseek-V3 demonstrates troublingly inconsistent boundary enforcement and performative refusals.
arXiv Detail & Related papers (2025-06-05T18:55:37Z)
MIRROR: Modular Internal Processing for Personalized Safety in LLM Dialogue [0.0]
Large language models generate harmful recommendations in personal multi-turn dialogue by ignoring user-specific safety context.<n>We introduce MIRROR, a modular production-focused architecture that prevents these failures through a persistent, bounded internal state.
arXiv Detail & Related papers (2025-05-31T07:17:48Z)
Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground Truth [0.0]
We introduce a new approach in which several advanced large language models produce and answer intricate, doctoral-level probability problems.<n>Our investigation focuses on how agreement among diverse models can signal the reliability of their outputs.
arXiv Detail & Related papers (2025-02-28T06:20:52Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation [62.44907105496227]
MindDial is a novel conversational framework that can generate situated free-form responses with theory-of-mind modeling. We introduce an explicit mind module that can track the speaker's belief and the speaker's prediction of the listener's belief. Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation.
arXiv Detail & Related papers (2023-06-27T07:24:32Z)
Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics and Prompt Wording [0.0]
We analyze what confuses GPT-3: how the model responds to certain sensitive topics and what effects the prompt wording has on the model response. We find that GPT-3 correctly disagrees with obvious Conspiracies and Stereotypes but makes mistakes with common Misconceptions and Controversies. The model responses are inconsistent across prompts and settings, highlighting GPT-3's unreliability.
arXiv Detail & Related papers (2023-06-09T19:07:31Z)
DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework. It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue. Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.