Related papers: Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling

Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling

URL: http://arxiv.org/abs/2601.02337v1
Date: Mon, 05 Jan 2026 18:32:45 GMT
Title: Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling
Authors: Berk Atil, Rebecca J. Passonneau, Ninareh Mehrabi,
Abstract summary: Toxicity detection is inherently subjective, shaped by the diverse perspectives and social priors of different demographic groups.<n>Current Large Language Model (LLM) prompting techniques have different results across different personas and base models.<n>We propose a lightweight meta-ensemble: an SVM over the 4-bit vector of prompt predictions.
Score: 6.038385461314376
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Toxicity detection is inherently subjective, shaped by the diverse perspectives and social priors of different demographic groups. While ``pluralistic'' modeling as used in economics and the social sciences aims to capture perspective differences across contexts, current Large Language Model (LLM) prompting techniques have different results across different personas and base models. In this work, we conduct a systematic evaluation of persona-aware toxicity detection, showing that no single prompting method, including our proposed automated prompt optimization strategy, uniformly dominates across all model-persona pairs. To exploit complementary errors, we explore ensembling four prompting variants and propose a lightweight meta-ensemble: an SVM over the 4-bit vector of prompt predictions. Our results demonstrate that the proposed SVM ensemble consistently outperforms individual prompting methods and traditional majority-voting techniques, achieving the strongest overall performance across diverse personas. This work provides one of the first systematic comparisons of persona-conditioned prompting for toxicity detection and offers a robust method for pluralistic evaluation in subjective NLP tasks.

Related papers

Unlearning Comparator: A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods [23.6050988823262]
Machine Unlearning (MU) aims to remove target training data from a trained model so that the removed data no longer influences the model's behavior.<n>Yet, researchers in this rapidly emerging field face challenges in analyzing and understanding the behavior of different MU methods.<n>We introduce a visual analytics system, Unlearning Comparator, designed to facilitate the systematic evaluation of MU methods.
arXiv Detail & Related papers (2025-08-18T08:53:53Z)
Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems [3.011820285006942]
This study proposes a new multi-perspective approach using soft labels to encourage the development of perspective aware models.<n>We conduct an analysis across diverse subjective text classification tasks, including hate speech, irony, abusive language, and stance detection.<n>Results show that the multi-perspective approach better approximates human label distributions, as measured by Jensen-Shannon Divergence (JSD)<n>Our approach exhibits lower confidence in tasks like irony and stance detection, likely due to the inherent subjectivity present in the texts.
arXiv Detail & Related papers (2025-06-25T07:53:36Z)
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference [56.20586932251531]
We develop Prompt Detective, a statistical method to reliably determine whether a given system prompt was used by a third-party language model.<n>Our work reveals that even minor changes in system prompts manifest in distinct response distributions, enabling us to verify prompt usage with statistical significance.
arXiv Detail & Related papers (2025-02-14T08:00:42Z)
Adaptive Prompting: Ad-hoc Prompt Composition for Social Bias Detection [30.836788377666]
We propose an adaptive prompting approach that predicts the optimal prompt composition ad-hoc for a given input.<n>We apply our approach to social bias detection, a highly context-dependent task that requires semantic understanding.<n>Our approach robustly ensures high detection performance, and is best in several settings.
arXiv Detail & Related papers (2025-02-10T14:06:19Z)
Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models [10.565316815513235]
Large language models (LLMs) may still exhibit implicit biases when simulating human behavior.<n>We show that state-of-the-art LLMs exhibit significant sociodemographic disparities in nearly all simulations.<n>When comparing our findings to real-world disparities reported in empirical studies, we find that the biases we uncovered are directionally aligned but markedly amplified.
arXiv Detail & Related papers (2025-01-29T05:21:31Z)
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models [51.067146460271466]
Evaluation of visual generative models can be time-consuming and computationally expensive.<n>We propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations.<n>It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools.
arXiv Detail & Related papers (2024-12-10T18:52:39Z)
PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, a framework for better data construction and model tuning.<n>For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction.<n>For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities.
arXiv Detail & Related papers (2024-07-17T08:13:22Z)
VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models [19.32035955420203]
We conduct the first comprehensive analysis of Large Multimodal Models (LMMs) using a variety of visual referring prompting strategies. We develop an automated assessment framework to evaluate the accuracy of LMMs without the need for human intervention or manual labeling. We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%.
arXiv Detail & Related papers (2023-12-07T06:53:55Z)
Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes [72.13373216644021]
We study the societal impact of machine learning by considering the collection of models that are deployed in a given context. We find deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available. These examples demonstrate ecosystem-level analysis has unique strengths for characterizing the societal impact of machine learning.
arXiv Detail & Related papers (2023-07-12T01:11:52Z)
MGTBench: Benchmarking Machine-Generated Text Detection [54.81446366272403]
This paper proposes the first benchmark framework for MGT detection against powerful large language models (LLMs) We show that a larger number of words in general leads to better performance and most detection methods can achieve similar performance with much fewer training samples. Our findings indicate that the model-based detection methods still perform well in the text attribution task.
arXiv Detail & Related papers (2023-03-26T21:12:36Z)
Exploiting Meta-Cognitive Features for a Machine-Learning-Based One-Shot Group-Decision Aggregation [0.7340017786387767]
Methods that rely on meta-cognitive information, such as confidence-based methods, had shown an improvement in various tasks. Our aim is to exploit meta-cognitive information and to learn from it, for the purpose of enhancing the ability of the group to produce a correct answer.
arXiv Detail & Related papers (2022-01-20T15:56:18Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.