Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection
- URL: http://arxiv.org/abs/2510.19331v1
- Date: Wed, 22 Oct 2025 07:48:57 GMT
- Title: Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection
- Authors: Ewelina Gajewska, Arda Derbent, Jaroslaw A Chudziak, Katarzyna Budzynska,
- Abstract summary: We investigate how personalising Large Language Models (Persona-LLMs) with annotator personas affects their sensitivity to hate speech.<n>We employ Google's Gemini and OpenAI's GPT-4.1-mini models and two persona-prompting methods.<n>We show that incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate how personalising Large Language Models (Persona-LLMs) with annotator personas affects their sensitivity to hate speech, particularly regarding biases linked to shared or differing identities between annotators and targets. To this end, we employ Google's Gemini and OpenAI's GPT-4.1-mini models and two persona-prompting methods: shallow persona prompting and a deeply contextualised persona development based on Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles. We analyse the impact of using in-group and out-group annotator personas on the models' detection performance and fairness across diverse social groups. This work bridges psychological insights on group identity with advanced NLP techniques, demonstrating that incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection. Our results highlight both the potential and limitations of persona-based approaches in reducing bias, offering valuable insights for developing more equitable hate speech detection systems.
Related papers
- Interpretable Debiasing of Vision-Language Models for Social Fairness [55.85977929985967]
We introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in Vision-Language models.<n>We train SAEs on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics.<n>Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.
arXiv Detail & Related papers (2026-02-27T13:37:11Z) - Us-vs-Them bias in Large Language Models [0.569978892646475]
We find consistent ingroup-positive and outgroup-negative associations across foundational large language models.<n>For personas examined, conservative personas exhibit greater outgroup hostility, whereas liberal personas display stronger ingroup solidarity.
arXiv Detail & Related papers (2025-12-03T07:11:22Z) - Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models [47.110656690979695]
We present the first comprehensive study on the role of persona prompts in hate speech classification.<n>A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior.<n>Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases.
arXiv Detail & Related papers (2025-06-10T09:02:55Z) - Assessing the Human Likeness of AI-Generated Counterspeech [10.434435022492723]
This paper investigates the human likeness of AI-generated counterspeech.<n>We implement and evaluate several LLM-based generation strategies.<n>We reveal differences in linguistic characteristics, politeness, and specificity.
arXiv Detail & Related papers (2024-10-14T18:48:47Z) - Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets [0.6918368994425961]
We leverage an extensive dataset with rich socio-demographic information of both annotators and targets.<n>Our analysis surfaces the presence of widespread biases, which we quantitatively describe and characterize based on their intensity and prevalence.<n>Our work offers new and nuanced results on human biases in hate speech annotations, as well as fresh insights into the design of AI-driven hate speech detection systems.
arXiv Detail & Related papers (2024-10-10T14:48:57Z) - Beyond Hate Speech: NLP's Challenges and Opportunities in Uncovering Dehumanizing Language [9.06965602117689]
Dehumanization, i.e., denying human qualities to individuals or groups, is a particularly harmful form of hate speech.<n>Despite advances in NLP for detecting general hate speech, approaches to identifying dehumanizing language remain limited.<n>We systematically evaluate four state-of-the-art large language models (LLMs) for dehumanization detection.
arXiv Detail & Related papers (2024-02-21T13:57:36Z) - On the steerability of large language models toward data-driven personas [98.9138902560793]
Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented.
Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs.
arXiv Detail & Related papers (2023-11-08T19:01:13Z) - PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for
Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner.
Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z) - Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona
Biases in Dialogue Systems [103.416202777731]
We study "persona biases", which we define to be the sensitivity of dialogue models' harmful behaviors contingent upon the personas they adopt.
We categorize persona biases into biases in harmful expression and harmful agreement, and establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement.
arXiv Detail & Related papers (2023-10-08T21:03:18Z) - Revealing Persona Biases in Dialogue Systems [64.96908171646808]
We present the first large-scale study on persona biases in dialogue systems.
We conduct analyses on personas of different social classes, sexual orientations, races, and genders.
In our studies of the Blender and DialoGPT dialogue systems, we show that the choice of personas can affect the degree of harms in generated responses.
arXiv Detail & Related papers (2021-04-18T05:44:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.