Related papers: GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models

GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models

URL: http://arxiv.org/abs/2406.14903v2
Date: Mon, 24 Jun 2024 14:57:18 GMT
Title: GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models
Authors: Leyan Wang, Yonggang Jin, Tianhao Shen, Tianyu Zheng, Xinrun Du, Chenchen Zhang, Wenhao Huang, Jiaheng Liu, Shi Wang, Ge Zhang, Liuyu Xiang, Zhaofeng He,
Abstract summary: We introduce GIEBench, a benchmark for empathy evaluation of large language models (LLMs) GIEBench includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities. Our evaluation of 23 LLMs revealed that while these LLMs understand different identity standpoints, they fail to consistently exhibit equal empathy across these identities without explicit instructions to adopt those perspectives.
Score: 18.92131015111012
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: As large language models (LLMs) continue to develop and gain widespread application, the ability of LLMs to exhibit empathy towards diverse group identities and understand their perspectives is increasingly recognized as critical. Most existing benchmarks for empathy evaluation of LLMs focus primarily on universal human emotions, such as sadness and pain, often overlooking the context of individuals' group identities. To address this gap, we introduce GIEBench, a comprehensive benchmark that includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities. GIEBench is designed to evaluate the empathy of LLMs when presented with specific group identities such as gender, age, occupation, and race, emphasizing their ability to respond from the standpoint of the identified group. This supports the ongoing development of empathetic LLM applications tailored to users with different identities. Our evaluation of 23 LLMs revealed that while these LLMs understand different identity standpoints, they fail to consistently exhibit equal empathy across these identities without explicit instructions to adopt those perspectives. This highlights the need for improved alignment of LLMs with diverse values to better accommodate the multifaceted nature of human identities. Our datasets are available at https://github.com/GIEBench/GIEBench.

Related papers

Us-vs-Them bias in Large Language Models [0.569978892646475]
We find consistent ingroup-positive and outgroup-negative associations across foundational large language models.<n>For personas examined, conservative personas exhibit greater outgroup hostility, whereas liberal personas display stronger ingroup solidarity.
arXiv Detail & Related papers (2025-12-03T07:11:22Z)
Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model's Empathy [1.6489674562395387]
We investigate how Large Language Models' cognitive and affective empathy vary across user personas defined by intersecting demographic attributes.<n>Our study introduces a novel intersectional analysis spanning 315 unique personas, constructed from combinations of age, culture, and gender.<n>We show that they broadly reflect real-world empathetic trends, with notable misalignments for certain groups, such as those from Confucian culture.
arXiv Detail & Related papers (2025-10-11T20:04:57Z)
HebID: Detecting Social Identities in Hebrew-language Political Text [1.435381256004719]
We introduce HebID, the first multilabel Hebrew corpus for social identity detection.<n>We benchmark multilabel and single-label encoders alongside 2B-9B- parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results.<n>We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities.
arXiv Detail & Related papers (2025-08-21T12:01:56Z)
Language Models Change Facts Based on the Way You Talk [38.44076602344941]
We find that large language models (LLMs) are extremely sensitive to markers of identity in user queries.<n>These biases mean that the use of off-the-shelf LLMs for these applications may cause harmful differences in medical care, foster wage gaps, and create different political factual realities for people of different identities.
arXiv Detail & Related papers (2025-07-17T13:21:17Z)
PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory [24.290880164707122]
We introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate large language models' decision-making.<n>In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people.<n>Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences.
arXiv Detail & Related papers (2025-06-27T07:09:11Z)
SocialEval: Evaluating Social Intelligence of Large Language Models [70.90981021629021]
Social Intelligence (SI) equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals.<n>This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation.<n>We propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts.
arXiv Detail & Related papers (2025-06-01T08:36:51Z)
Language Models Predict Empathy Gaps Between Social In-groups and Out-groups [36.16981127295606]
Studies of human psychology have demonstrated that people are more motivated to extend empathy to in-group members than out-group members. This study investigates how this aspect of intergroup relations in humans is replicated by LLMs in an emotion intensity prediction task.
arXiv Detail & Related papers (2025-03-02T21:31:14Z)
Socio-Culturally Aware Evaluation Framework for LLM-Based Content Moderation [10.724258809442958]
We propose a socio-culturally aware evaluation framework for content moderation. We introduce a scalable method for creating diverse datasets using persona-based generation.
arXiv Detail & Related papers (2024-12-18T07:57:18Z)
I'm Spartacus, No, I'm Spartacus: Measuring and Understanding LLM Identity Confusion [33.80335805399509]
Large Language Models (LLMs) excel in diverse tasks such as text generation, data analysis, and software development. However, the rapid proliferation of LLMs has raised concerns about their originality and trustworthiness. This study systematically examines identity confusion through three research questions.
arXiv Detail & Related papers (2024-11-16T03:20:39Z)
Large Language Models Reflect the Ideology of their Creators [73.25935570218375]
Large language models (LLMs) are trained on vast amounts of data to generate natural language. We uncover notable diversity in the ideological stance exhibited across different LLMs and languages.
arXiv Detail & Related papers (2024-10-24T04:02:30Z)
Hate Personified: Investigating the role of LLMs in content moderation [64.26243779985393]
For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected.
arXiv Detail & Related papers (2024-10-03T16:43:17Z)
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model [52.697180472760635]
This paper explores the potential of character identities memory and recognition across multiple visual scenarios. We propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM. Our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions.
arXiv Detail & Related papers (2024-07-10T12:11:59Z)
Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation. We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
ToMBench: Benchmarking Theory of Mind in Large Language Models [42.80231362967291]
ToM is the cognitive capability to perceive and ascribe mental states to oneself and others. Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination. We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
arXiv Detail & Related papers (2024-02-23T02:05:46Z)
Large language models should not replace human participants because they can misportray and flatten identity groups [36.36009232890876]
We show that there are two inherent limitations in the way current LLMs are trained that prevent this. We argue analytically for why LLMs are likely to both misportray and flatten the representations of demographic groups. We also discuss a third limitation about how identity prompts can essentialize identities.
arXiv Detail & Related papers (2024-02-02T21:21:06Z)
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench [20.909504977779978]
We introduce AwareBench, a benchmark designed to evaluate awareness in large language models (LLMs) We categorize awareness in LLMs into five dimensions, including capability, mission, emotion, culture, and perspective. Our experiments, conducted on 13 LLMs, reveal that the majority of them struggle to fully recognize their capabilities and missions while demonstrating decent social intelligence.
arXiv Detail & Related papers (2024-01-31T14:41:23Z)
Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs [13.744746481528711]
Large Language Models (LLMs) are widely used to simulate human responses across diverse contexts. We evaluate nine popular LLMs on their ability to understand demographic differences in two subjective judgment tasks: politeness and offensiveness. We find that in zero-shot settings, most models' predictions for both tasks align more closely with labels from White participants than those from Asian or Black participants.
arXiv Detail & Related papers (2023-11-16T10:02:24Z)
On the steerability of large language models toward data-driven personas [98.9138902560793]
Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented. Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs.
arXiv Detail & Related papers (2023-11-08T19:01:13Z)
Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans. We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.