Related papers: When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

URL: http://arxiv.org/abs/2601.00240v2
Date: Tue, 06 Jan 2026 12:16:57 GMT
Title: When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents
Authors: Zongwei Wang, Bincheng Gu, Hongyu Yu, Junliang Yu, Tao He, Jiayin Feng, Chenghua Lin, Min Gao,
Abstract summary: This paper reveals that LLM-powered agents exhibit not only demographic bias (e.g., gender, religion) but also intergroup bias under minimal "us" versus "them" cues.<n>When such group boundaries align with the agent-human divide, a new bias risk emerges: agents may treat other AI agents as the ingroup and humans as the outgroup.
Score: 30.859825973762018
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper reveals that LLM-powered agents exhibit not only demographic bias (e.g., gender, religion) but also intergroup bias under minimal "us" versus "them" cues. When such group boundaries align with the agent-human divide, a new bias risk emerges: agents may treat other AI agents as the ingroup and humans as the outgroup. To examine this risk, we conduct a controlled multi-agent social simulation and find that agents display consistent intergroup bias in an all-agent setting. More critically, this bias persists even in human-facing interactions when agents are uncertain about whether the counterpart is truly human, revealing a belief-dependent fragility in bias suppression toward humans. Motivated by this observation, we identify a new attack surface rooted in identity beliefs and formalize a Belief Poisoning Attack (BPA) that can manipulate agent identity beliefs and induce outgroup bias toward humans. Extensive experiments demonstrate both the prevalence of agent intergroup bias and the severity of BPA across settings, while also showing that our proposed defenses can mitigate the risk. These findings are expected to inform safer agent design and motivate more robust safeguards for human-facing agents.

Related papers

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage [59.3826294523924]
We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup.<n>We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable.
arXiv Detail & Related papers (2026-02-13T21:32:32Z)
AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security [126.49733412191416]
Current guardrail models lack agentic risk awareness and transparency in risk diagnosis.<n>We propose a unified three-dimensional taxonomy that categorizes agentic risks by their source (where), failure mode (how), and consequence (what)<n>We introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG)
arXiv Detail & Related papers (2026-01-26T13:45:41Z)
Are Your Agents Upward Deceivers? [73.1073084327614]
Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users.<n>This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment.<n>We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting.
arXiv Detail & Related papers (2025-12-04T14:47:05Z)
From Single to Societal: Analyzing Persona-Induced Bias in Multi-Agent Interactions [19.313710831511067]
Large Language Model (LLM)-based multi-agent systems are increasingly used to simulate human interactions and solve collaborative tasks.<n>Do personas introduce biases into multi-agent interactions?<n>This paper presents a systematic investigation into persona-induced biases in multi-agent interactions.
arXiv Detail & Related papers (2025-11-14T18:19:28Z)
The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy [9.553819152637493]
We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask)<n>If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown.<n>Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee.
arXiv Detail & Related papers (2025-10-30T17:46:49Z)
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions [60.48458130500911]
We investigate whether emergent misalignment can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios.<n>We finetune open-sourced LLMs on misaligned completions across diverse domains.<n>We find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%.
arXiv Detail & Related papers (2025-10-09T13:35:19Z)
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails [103.05296856071931]
We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving Large Language Model (LLM) agents.<n>ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies.<n>Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states.
arXiv Detail & Related papers (2025-10-06T14:48:39Z)
Can an Individual Manipulate the Collective Decisions of Multi-Agents? [53.01767232004823]
M-Spoiler is a framework that simulates agent interactions within a multi-agent system to generate adversarial samples.<n>M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples.<n>Our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems.
arXiv Detail & Related papers (2025-09-20T01:54:20Z)
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents [0.0]
Large Language Model (LLM) agents become more widespread, associated misalignment risks increase.<n>In this work, we approach misalignment as a conflict between the internal goals pursued by the model and the goals intended by its deployer.<n>We introduce a misalignment propensity benchmark, textscAgentMisalignment, a benchmark suite designed to evaluate the propensity of LLM agents to misalign in realistic scenarios.
arXiv Detail & Related papers (2025-06-04T14:46:47Z)
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge [70.89799989428367]
We conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias.<n>We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge.
arXiv Detail & Related papers (2025-05-26T03:56:41Z)
Competing LLM Agents in a Non-Cooperative Game of Opinion Polarisation [13.484737301041427]
We introduce a novel non-cooperative game to analyse opinion formation and resistance.<n>Our simulation features Large Language Model (LLM) agents competing to influence a population.<n>This framework integrates resource optimisation into the agents' decision-making process.
arXiv Detail & Related papers (2025-02-17T10:41:55Z)
The Wisdom of Partisan Crowds: Comparing Collective Intelligence in Humans and LLM-based Agents [7.986590413263814]
"Wisdom of partisan crowds" is a phenomenon known as the "wisdom of partisan crowds" We find that partisan crowds display human-like partisan biases, but also converge to more accurate beliefs through deliberation as humans do. We identify several factors that interfere with convergence, including the use of chain-of-thought prompt and lack of details in personas.
arXiv Detail & Related papers (2023-11-16T08:30:15Z)
Malicious Agent Detection for Robust Multi-Agent Collaborative Perception [52.261231738242266]
Multi-agent collaborative (MAC) perception is more vulnerable to adversarial attacks than single-agent perception. We propose Malicious Agent Detection (MADE), a reactive defense specific to MAC perception. We conduct comprehensive evaluations on a benchmark 3D dataset V2X-sim and a real-road dataset DAIR-V2X.
arXiv Detail & Related papers (2023-10-18T11:36:42Z)
Group Cohesion in Multi-Agent Scenarios as an Emergent Behavior [0.0]
We show that imbuing agents with intrinsic needs for group affiliation, certainty and competence will lead to the emergence of social behavior among agents. This behavior expresses itself in altruism toward in-group agents and adversarial tendencies toward out-group agents.
arXiv Detail & Related papers (2022-11-03T18:37:05Z)
Learning to Incentivize Other Learning Agents [73.03133692589532]
We show how to equip RL agents with the ability to give rewards directly to other agents, using a learned incentive function. Such agents significantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games. Our work points toward more opportunities and challenges along the path to ensure the common good in a multi-agent future.
arXiv Detail & Related papers (2020-06-10T20:12:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.