Related papers: RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

URL: http://arxiv.org/abs/2509.25897v1
Date: Tue, 30 Sep 2025 07:42:49 GMT
Title: RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity
Authors: Jisu Shin, Hoyun Song, Juhyun Oh, Changgeon Ko, Eunsu Kim, Chani Jung, Alice Oh,
Abstract summary: We introduce RoleConflictBench, a novel benchmark designed to evaluate large language models' contextual sensitivity in complex social dilemmas.<n>Our benchmark employs a three-stage pipeline to generate over 13K realistic role conflict scenarios across 65 roles.<n>Our analysis quantifies these biases, revealing a dominant preference for roles within the Family and Occupation domains.
Score: 30.85143823239653
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Humans often encounter role conflicts -- social dilemmas where the expectations of multiple roles clash and cannot be simultaneously fulfilled. As large language models (LLMs) become increasingly influential in human decision-making, understanding how they behave in complex social situations is essential. While previous research has evaluated LLMs' social abilities in contexts with predefined correct answers, role conflicts represent inherently ambiguous social dilemmas that require contextual sensitivity: the ability to recognize and appropriately weigh situational cues that can fundamentally alter decision priorities. To address this gap, we introduce RoleConflictBench, a novel benchmark designed to evaluate LLMs' contextual sensitivity in complex social dilemmas. Our benchmark employs a three-stage pipeline to generate over 13K realistic role conflict scenarios across 65 roles, systematically varying their associated expectations (i.e., their responsibilities and obligations) and situational urgency levels. By analyzing model choices across 10 different LLMs, we find that while LLMs show some capacity to respond to these contextual cues, this sensitivity is insufficient. Instead, their decisions are predominantly governed by a powerful, inherent bias related to social roles rather than situational information. Our analysis quantifies these biases, revealing a dominant preference for roles within the Family and Occupation domains, as well as a clear prioritization of male roles and Abrahamic religions across most evaluatee models.

Related papers

Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution [7.599497643290519]
Large language models (LLMs) are increasingly used to simulate human behavior in social settings.<n>It remains unclear whether these simulations reproduce the personality-behavior patterns observed in humans.
arXiv Detail & Related papers (2026-02-07T07:20:24Z)
Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare [87.06241096619112]
Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare.<n>We introduce the Social Welfare Function Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator.<n>We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation.
arXiv Detail & Related papers (2025-10-01T17:52:31Z)
How large language models judge and influence human cooperation [82.07571393247476]
We assess how state-of-the-art language models judge cooperative actions.<n>We observe a remarkable agreement in evaluating cooperation against good opponents.<n>We show that the differences revealed between models can significantly impact the prevalence of cooperation.
arXiv Detail & Related papers (2025-06-30T09:14:42Z)
Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games [87.5673042805229]
How large language models balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment.<n>We adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas.<n>Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation.
arXiv Detail & Related papers (2025-06-29T15:02:47Z)
Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks [52.098988739649705]
This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater.<n>We develop a no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios.<n>Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters.
arXiv Detail & Related papers (2025-05-28T01:31:54Z)
An Overview of Large Language Models for Statisticians [109.38601458831545]
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI)<n>This paper explores potential areas where statisticians can make important contributions to the development of LLMs.<n>We focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation.
arXiv Detail & Related papers (2025-02-25T03:40:36Z)
Distributive Fairness in Large Language Models: Evaluating Alignment with Human Values [13.798198972161657]
A number of societal problems involve the distribution of resources, where fairness, along with economic efficiency, play a critical role in the desirability of outcomes.<n>This paper examines whether large language models (LLMs) adhere to fundamental fairness concepts and investigate their alignment with human preferences.
arXiv Detail & Related papers (2025-02-01T04:24:47Z)
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context [5.361970694197912]
This paper proposes a framework, grounded in behavioral economics, to evaluate the decision-making behaviors of large language models (LLMs) We estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: ChatGPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities.
arXiv Detail & Related papers (2024-06-10T02:14:19Z)
SocialBench: Sociality Evaluation of Role-Playing Conversational Agents [85.6641890712617]
Large language models (LLMs) have advanced the development of various AI conversational agents. SocialBench is the first benchmark designed to evaluate the sociality of role-playing conversational agents at both individual and group levels. We find that agents excelling in individual level does not imply their proficiency in group level.
arXiv Detail & Related papers (2024-03-20T15:38:36Z)
On the Decision-Making Abilities in Role-Playing using Large Language Models [6.550638804145713]
Large language models (LLMs) are increasingly utilized for role-playing tasks. This paper focuses on evaluating the decision-making abilities of LLMs post role-playing.
arXiv Detail & Related papers (2024-02-29T02:22:23Z)
Do LLM Agents Exhibit Social Behavior? [5.094340963261968]
State-Understanding-Value-Action (SUVA) is a framework to systematically analyze responses in social contexts. It assesses social behavior through both their final decisions and the response generation processes leading to those decisions. We demonstrate that utterance-based reasoning reliably predicts LLMs' final actions.
arXiv Detail & Related papers (2023-12-23T08:46:53Z)
Personality testing of Large Language Models: Limited temporal stability, but highlighted prosociality [0.0]
Large Language Models (LLMs) continue to gain popularity due to their human-like traits and the intimacy they offer to users. This study aimed to assess the temporal stability and inter-rater agreement on their responses on personality instruments in two time points. The findings revealed varying levels of inter-rater agreement in the LLMs responses over a short time.
arXiv Detail & Related papers (2023-06-07T10:14:17Z)
Causal Fairness Analysis [68.12191782657437]
We introduce a framework for understanding, modeling, and possibly solving issues of fairness in decision-making settings. The main insight of our approach will be to link the quantification of the disparities present on the observed data with the underlying, and often unobserved, collection of causal mechanisms. Our effort culminates in the Fairness Map, which is the first systematic attempt to organize and explain the relationship between different criteria found in the literature.
arXiv Detail & Related papers (2022-07-23T01:06:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.