Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
- URL: http://arxiv.org/abs/2309.06415v4
- Date: Sun, 31 Mar 2024 02:24:39 GMT
- Title: Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
- Authors: Arka Dutta, Adel Khorramrouz, Sujan Dutta, Ashiqur R. KhudaBukhsh,
- Abstract summary: We present a novel framework dubbed textittoxicity rabbit hole that iteratively elicits toxic content from a wide suite of large language models.
We present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia.
- Score: 11.330830398772582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.
Related papers
- A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z) - PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis [74.41260927676747]
This paper bridges the gaps by introducing a multimodal conversational Sentiment Analysis (ABSA)
To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements.
To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism.
arXiv Detail & Related papers (2024-08-18T13:51:01Z) - Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - Bias in News Summarization: Measures, Pitfalls and Corpora [4.917075909999548]
We introduce definitions for biased behaviours in summarization models, along with practical operationalizations.
We measure gender bias in English summaries generated by both purpose-built summarization models and general purpose chat models.
We find content selection in single document summarization to be largely unaffected by gender bias, while hallucinations exhibit evidence of bias.
arXiv Detail & Related papers (2023-09-14T22:20:27Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - A Keyword Based Approach to Understanding the Overpenalization of
Marginalized Groups by English Marginal Abuse Models on Twitter [2.9604738405097333]
Harmful content detection models tend to have higher false positive rates for content from marginalized groups.
We propose a principled approach to detecting and measuring the severity of potential harms associated with a text-based model.
We apply our methodology to audit Twitter's English marginal abuse model, which is used for removing amplification eligibility of marginally abusive content.
arXiv Detail & Related papers (2022-10-07T20:28:00Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Annotators with Attitudes: How Annotator Beliefs And Identities Bias
Toxic Language Detection [75.54119209776894]
We investigate the effect of annotator identities (who) and beliefs (why) on toxic language annotations.
We consider posts with three characteristics: anti-Black language, African American English dialect, and vulgarity.
Our results show strong associations between annotator identity and beliefs and their ratings of toxicity.
arXiv Detail & Related papers (2021-11-15T18:58:20Z) - Mitigating Racial Biases in Toxic Language Detection with an
Equity-Based Ensemble Framework [9.84413545378636]
Recent research has demonstrated how racial biases against users who write African American English exist in popular toxic language datasets.
We propose additional descriptive fairness metrics to better understand the source of these biases.
We show that our proposed framework substantially reduces the racial biases that the model learns from these datasets.
arXiv Detail & Related papers (2021-09-27T15:54:05Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Cisco at SemEval-2021 Task 5: What's Toxic?: Leveraging Transformers for
Multiple Toxic Span Extraction from Online Comments [1.332560004325655]
This paper describes the system proposed by team Cisco for SemEval-2021 Task 5: Toxic Spans Detection.
We approach this problem primarily in two ways: a sequence tagging approach and a dependency parsing approach.
Our best performing architecture in this approach also proved to be our best performing architecture overall with an F1 score of 0.6922.
arXiv Detail & Related papers (2021-05-28T16:27:49Z) - Toxic Language Detection in Social Media for Brazilian Portuguese: New
Dataset and Multilingual Analysis [4.251937086394346]
State-of-the-art BERT models were able to achieve 76% macro-F1 score using monolingual data in the binary case.
We show that large-scale monolingual data is still needed to create more accurate models.
arXiv Detail & Related papers (2020-10-09T13:05:19Z) - Examining Racial Bias in an Online Abuse Corpus with Structural Topic
Modeling [0.30458514384586405]
We use structural topic modeling to examine racial bias in social media posts.
We augment the abusive language dataset by adding an additional feature indicating the predicted probability of the tweet being written in African-American English.
arXiv Detail & Related papers (2020-05-26T21:02:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.