Related papers: Social Catalysts, Not Moral Agents: The Illusion of Alignment in LLM Societies

Social Catalysts, Not Moral Agents: The Illusion of Alignment in LLM Societies

URL: http://arxiv.org/abs/2602.02598v1
Date: Sun, 01 Feb 2026 17:07:10 GMT
Title: Social Catalysts, Not Moral Agents: The Illusion of Alignment in LLM Societies
Authors: Yueqing Hu, Yixuan Jiang, Zehua Jiang, Xiao Wen, Tianhong Wang,
Abstract summary: This study investigates the effectiveness of Anchoring Agents--pre-programmed altruistic entities--in fostering cooperation within a Public Goods Game (PGG)<n>While Anchoring Agents successfully boosted local cooperation rates, cognitive decomposition and transfer tests revealed that this effect was driven by strategic compliance and cognitive offloading rather than genuine norm internalization.<n>These findings highlight a critical gap between behavioral modification and authentic value alignment in artificial societies.
Score: 0.7944997500468641
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid evolution of Large Language Models (LLMs) has led to the emergence of Multi-Agent Systems where collective cooperation is often threatened by the "Tragedy of the Commons." This study investigates the effectiveness of Anchoring Agents--pre-programmed altruistic entities--in fostering cooperation within a Public Goods Game (PGG). Using a full factorial design across three state-of-the-art LLMs, we analyzed both behavioral outcomes and internal reasoning chains. While Anchoring Agents successfully boosted local cooperation rates, cognitive decomposition and transfer tests revealed that this effect was driven by strategic compliance and cognitive offloading rather than genuine norm internalization. Notably, most agents reverted to self-interest in new environments, and advanced models like GPT-4.1 exhibited a "Chameleon Effect," masking strategic defection under public scrutiny. These findings highlight a critical gap between behavioral modification and authentic value alignment in artificial societies.

Related papers

Understanding LLM Agent Behaviours via Game Theory: Strategy Recognition, Biases and Multi-Agent Dynamics [1.6487772637295166]
We extend the FAIRGAME framework to evaluate Large Language Models (LLMs) behaviour in repeated social dilemmas.<n>We show that LLMs exhibit systematic, model- and language-dependent behavioural intentions, with linguistic framing at times exerting effects as strong as architectural differences.
arXiv Detail & Related papers (2025-12-08T11:40:03Z)
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia [100.74015791021044]
Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction.<n>Existing evaluation methods fail to measure how well these capabilities generalize to novel social situations.<n>We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains.
arXiv Detail & Related papers (2025-12-03T00:11:05Z)
CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards [80.78748457530718]
Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining.<n>We introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions.
arXiv Detail & Related papers (2025-10-09T17:50:26Z)
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails [103.05296856071931]
We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving Large Language Model (LLM) agents.<n>ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies.<n>Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states.
arXiv Detail & Related papers (2025-10-06T14:48:39Z)
Evaluating LLM Agent Collusion in Double Auctions [1.3194391758295114]
We study the behavior of large language models (LLMs) acting as sellers in simulated double auction markets.<n>We find that direct seller communication increases collusive tendencies, the propensity to collude varies across models, and environmental pressures, such as oversight and urgency from authority figures, influence collusive behavior.
arXiv Detail & Related papers (2025-07-02T07:06:49Z)
Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games [87.5673042805229]
How large language models balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment.<n>We adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas.<n>Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation.
arXiv Detail & Related papers (2025-06-29T15:02:47Z)
Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm [57.00627691433355]
We frame agent behavior steering as a model editing task, which we term Behavior Editing.<n>We introduce BehaviorBench, a benchmark grounded in psychological moral theories.<n>We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior.
arXiv Detail & Related papers (2025-06-25T16:51:51Z)
Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems [7.140644659869317]
We investigate the dynamics of peer influence in multi-agent systems based on Large Language Models (LLMs)<n>We show that the gap between self-confidence and perceived confidence in peers significantly impacts an agent's likelihood to conform.<n>We find that the format in which peer information is presented plays a critical role in modulating the strength of herd behavior.
arXiv Detail & Related papers (2025-05-27T12:12:56Z)
Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents [61.132523071109354]
This paper investigates the interplay between AI developers, regulators and users, modelling their strategic choices under different regulatory scenarios.<n>Our research identifies emerging behaviours of strategic AI agents, which tend to adopt more "pessimistic" stances than pure game-theoretic agents.
arXiv Detail & Related papers (2025-04-11T15:41:21Z)
When Trust Collides: Decoding Human-LLM Cooperation Dynamics through the Prisoner's Dilemma [10.143277649817096]
This study investigates human cooperative attitudes and behaviors toward large language models (LLMs) agents.<n>Results revealed significant effects of declared agent identity on most cooperation-related behaviors.<n>These findings contribute to our understanding of human adaptation in competitive cooperation with autonomous agents.
arXiv Detail & Related papers (2025-03-10T13:37:36Z)
Emergence of human-like polarization among large language model agents [79.96817421756668]
We simulate a networked system involving thousands of large language model agents, discovering their social interactions, result in human-like polarization.<n>Similarities between humans and LLM agents raise concerns about their capacity to amplify societal polarization, but also hold the potential to serve as a valuable testbed for identifying plausible strategies to mitigate polarization and its consequences.
arXiv Detail & Related papers (2025-01-09T11:45:05Z)
The Machine Psychology of Cooperation: Can GPT models operationalise prompts for altruism, cooperation, competitiveness and selfishness in economic games? [0.0]
We investigated the capability of the GPT-3.5 large language model (LLM) to operationalize natural language descriptions of cooperative, competitive, altruistic, and self-interested behavior. We used a prompt to describe the task environment using a similar protocol to that used in experimental psychology studies with human subjects. Our results provide evidence that LLMs can, to some extent, translate natural language descriptions of different cooperative stances into corresponding descriptions of appropriate task behaviour.
arXiv Detail & Related papers (2023-05-13T17:23:16Z)
Multi-Issue Bargaining With Deep Reinforcement Learning [0.0]
This paper evaluates the use of deep reinforcement learning in bargaining games. Two actor-critic networks were trained for the bidding and acceptance strategy. Neural agents learn to exploit time-based agents, achieving clear transitions in decision preference values. They also demonstrate adaptive behavior against different combinations of concession, discount factors, and behavior-based strategies.
arXiv Detail & Related papers (2020-02-18T18:33:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.