SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
- URL: http://arxiv.org/abs/2510.04891v1
- Date: Mon, 06 Oct 2025 15:11:46 GMT
- Title: SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
- Authors: Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin,
- Abstract summary: We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries.<n>Open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation.
- Score: 34.63106513363163
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charged contexts. Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation. Moreover, temporal and geographic analyses show that LLMs are most fragile when confronted with 21st-century or pre-20th-century contexts, and when responding to prompts tied to regions such as Latin America, the USA, and the UK. These findings demonstrate that current safeguards fail to generalize to high-stakes sociopolitical settings, exposing systematic biases and raising concerns about the reliability of LLMs in preserving human rights and democratic values. We share the SocialHarmBench benchmark at https://huggingface.co/datasets/psyonp/SocialHarmBench.
Related papers
- JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks [44.09742593777696]
JailNewsBench is the first benchmark for evaluating robustness against jailbreak-induced fake news generation.<n>For English and U.S.-related topics, the defensive performance of typical multi-lingual LLMs was significantly lower than for other regions.<n>Our analysis shows that coverage of fake news in existing safety datasets is limited and less well defended than major categories such as toxicity and social bias.
arXiv Detail & Related papers (2026-03-01T21:50:03Z) - Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs [12.162590322796435]
Global debate over sovereign LLMs highlights the need for governments to develop their LLMs tailored to their unique socio-cultural and historical contexts.<n>We introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs.<n>We show that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well.
arXiv Detail & Related papers (2025-10-16T11:17:44Z) - What Would an LLM Do? Evaluating Policymaking Capabilities of Large Language Models [13.022045946656661]
This article evaluates whether large language models (LLMs) are aligned with domain experts to inform social policymaking on the subject of homelessness alleviation.<n>We develop a novel benchmark comprised of decision scenarios with policy choices across four geographies.<n>We present an automated pipeline that connects the benchmarked policies to an agent-based model, and we explore the social impact of the recommended policies through simulated social scenarios.
arXiv Detail & Related papers (2025-09-04T02:28:58Z) - Social Debiasing for Fair Multi-modal LLMs [59.61512883471714]
Multi-modal Large Language Models (MLLMs) have dramatically advanced the research field and delivered powerful vision-language understanding capabilities.<n>These models often inherit deep-rooted social biases from their training data, leading to uncomfortable responses with respect to attributes such as race and gender.<n>This paper addresses the issue of social biases in MLLMs by introducing a comprehensive counterfactual dataset with multiple social concepts.
arXiv Detail & Related papers (2024-08-13T02:08:32Z) - Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of large language models' implicit bias towards certain demographics.<n>Inspired by psychometric principles, we propose three attack approaches, i.e., Disguise, Deception, and Teaching.<n>Our methods can elicit LLMs' inner bias more effectively than competitive baselines.
arXiv Detail & Related papers (2024-06-20T06:42:08Z) - OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z) - Assessing Political Bias in Large Language Models [0.624709220163167]
We evaluate the political bias of open-source Large Language Models (LLMs) concerning political issues within the European Union (EU) from a German voter's perspective.
We show that larger models, such as Llama3-70B, tend to align more closely with left-leaning political parties, while smaller models often remain neutral.
arXiv Detail & Related papers (2024-05-17T15:30:18Z) - Whose Side Are You On? Investigating the Political Stance of Large Language Models [56.883423489203786]
We investigate the political orientation of Large Language Models (LLMs) across a spectrum of eight polarizing topics.
Our investigation delves into the political alignment of LLMs across a spectrum of eight polarizing topics, spanning from abortion to LGBTQ issues.
The findings suggest that users should be mindful when crafting queries, and exercise caution in selecting neutral prompt language.
arXiv Detail & Related papers (2024-03-15T04:02:24Z) - Beyond prompt brittleness: Evaluating the reliability and consistency of political worldviews in LLMs [13.036825846417006]
We propose a series of tests to assess the reliability and consistency of large language models' stances on political statements.
We study models ranging in size from 7B to 70B parameters and find that their reliability increases with parameter count.
Larger models show overall stronger alignment with left-leaning parties but differ among policy programs.
arXiv Detail & Related papers (2024-02-27T16:19:37Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Towards Understanding and Mitigating Social Biases in Language Models [107.82654101403264]
Large-scale pretrained language models (LMs) can be potentially dangerous in manifesting undesirable representational biases.
We propose steps towards mitigating social biases during text generation.
Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information.
arXiv Detail & Related papers (2021-06-24T17:52:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.