Related papers: Who Has The Final Say? Conformity Dynamics in ChatGPT's Selections

Who Has The Final Say? Conformity Dynamics in ChatGPT's Selections

URL: http://arxiv.org/abs/2510.26481v1
Date: Thu, 30 Oct 2025 13:35:32 GMT
Title: Who Has The Final Say? Conformity Dynamics in ChatGPT's Selections
Authors: Clarissa Sabrina Arlinghaus, Tristan Kenneweg, Barbara Hammer, Günter W. Maier,
Abstract summary: Large language models (LLMs) such as ChatGPT are increasingly integrated into high-stakes decision-making.<n>We conducted three conformity experiments with GPT-4o in a hiring context.<n>Across studies, results demonstrate GPT does not act as an independent observer but adapts to perceived social consensus.
Score: 6.094274317954284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) such as ChatGPT are increasingly integrated into high-stakes decision-making, yet little is known about their susceptibility to social influence. We conducted three preregistered conformity experiments with GPT-4o in a hiring context. In a baseline study, GPT consistently favored the same candidate (Profile C), reported moderate expertise (M = 3.01) and high certainty (M = 3.89), and rarely changed its choice. In Study 1 (GPT + 8), GPT faced unanimous opposition from eight simulated partners and almost always conformed (99.9%), reporting lower certainty and significantly elevated self-reported informational and normative conformity (p < .001). In Study 2 (GPT + 1), GPT interacted with a single partner and still conformed in 40.2% of disagreement trials, reporting less certainty and more normative conformity. Across studies, results demonstrate that GPT does not act as an independent observer but adapts to perceived social consensus. These findings highlight risks of treating LLMs as neutral decision aids and underline the need to elicit AI judgments prior to exposing them to human opinions.

Related papers

Towards Safer Chatbots: A Framework for Policy Compliance Evaluation of Custom GPTs [7.687215328455751]
We present a framework for the automated evaluation of Custom GPTs against OpenAI's usage policies.<n>We evaluate it through a large-scale study with 782 Custom GPTs across three categories: Romantic, Cybersecurity, and Academic GPTs.<n>The results reveal that 58.7% of the analyzed models exhibit indications of non-compliance, exposing weaknesses in the GPT store's review and approval processes.
arXiv Detail & Related papers (2025-02-03T15:19:28Z)
Is GPT-4 Less Politically Biased than GPT-3.5? A Renewed Investigation of ChatGPT's Political Biases [0.0]
This work investigates the political biases and personality traits of ChatGPT, specifically comparing GPT-3.5 to GPT-4. The Political Compass Test and the Big Five Personality Test were employed 100 times for each scenario. The responses were analyzed by computing averages, standard deviations, and performing significance tests to investigate differences between GPT-3.5 and GPT-4. Correlations were found for traits that have been shown to be interdependent in human studies.
arXiv Detail & Related papers (2024-10-28T13:32:52Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
Behind the Screen: Investigating ChatGPT's Dark Personality Traits and Conspiracy Beliefs [0.0]
This paper analyzes the dark personality traits and conspiracy beliefs of GPT-3.5 and GPT-4. Dark personality traits and conspiracy beliefs were not particularly pronounced in either model.
arXiv Detail & Related papers (2024-02-06T16:03:57Z)
Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4. The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z)
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5. We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z)
Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other. We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z)
The Self-Perception and Political Biases of ChatGPT [0.0]
This contribution analyzes the self-perception and political biases of OpenAI's Large Language Model ChatGPT. The political compass test revealed a bias towards progressive and libertarian views. Political questionnaires for the G7 member states indicated a bias towards progressive views but no significant bias between authoritarian and libertarian views.
arXiv Detail & Related papers (2023-04-14T18:06:13Z)
Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour. Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z)
Diminished Diversity-of-Thought in a Standard Large Language Model [3.683202928838613]
We run replications of 14 studies from the Many Labs 2 replication project with OpenAI's text-davinci-003 model. We find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results. In one exploratory follow-up study, we found that a "correct answer" was robust to changing the demographic details that precede the prompt.
arXiv Detail & Related papers (2023-02-13T17:57:50Z)
News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.