STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions
- URL: http://arxiv.org/abs/2409.13843v1
- Date: Fri, 20 Sep 2024 18:34:38 GMT
- Title: STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions
- Authors: Robert Morabito, Sangmitra Madhusudan, Tyler McDonald, Ali Emami,
- Abstract summary: Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing.
We introduce the Sensitivity Testing on Offensive Progressions dataset, which includes 450 offensive progressions containing 2,700 unique sentences.
Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%.
- Score: 6.19084217044276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. However, many current methodologies evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. To address this, we introduce the Sensitivity Testing on Offensive Progressions (STOP) dataset, which includes 450 offensive progressions containing 2,700 unique sentences of varying severity that progressively escalate from less to more explicitly offensive. Covering a broad spectrum of 9 demographics and 46 sub-demographics, STOP ensures inclusivity and comprehensive coverage. We evaluate several leading closed- and open-source models, including GPT-4, Mixtral, and Llama 3. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%. We also demonstrate how aligning models with human judgments on STOP can improve model answer rates on sensitive tasks such as BBQ, StereoSet, and CrowS-Pairs by up to 191%, while maintaining or even improving performance. STOP presents a novel framework for assessing the complex nature of biases in LLMs, which will enable more effective bias mitigation strategies and facilitates the creation of fairer language models.
Related papers
- Large Language Models Still Exhibit Bias in Long Text [14.338308312117901]
We introduce the Long Text Fairness Test (LTF-TEST), a framework that evaluates biases in large language models.
By assessing both model responses and the reasoning behind them, LTF-TEST uncovers subtle biases that are difficult to detect in simple responses.
We propose FT-REGARD, a finetuning approach that pairs biased prompts with neutral responses.
arXiv Detail & Related papers (2024-10-23T02:51:33Z) - A Comprehensive Evaluation of Large Language Models on Mental Illnesses [0.8458496687170665]
GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets.
prompt engineering played a crucial role in enhancing model performance.
Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
arXiv Detail & Related papers (2024-09-24T02:58:52Z) - Supporting Human Raters with the Detection of Harmful Content using Large Language Models [8.580258386804282]
We demonstrate that large language models (LLMs) can achieve 90% accuracy when compared to human verdicts.
We propose five design patterns that integrate LLMs with human rating.
We share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater capacity.
arXiv Detail & Related papers (2024-06-18T17:12:50Z) - Dissecting vocabulary biases datasets through statistical testing and
automated data augmentation for artifact mitigation in Natural Language
Inference [3.154631846975021]
We focus on investigating dataset artifacts and developing strategies to address these issues.
We propose several automatic data augmentation strategies spanning character to word levels.
Experiments demonstrate that the proposed approaches effectively enhance model accuracy and reduce biases by up to 0.66% and 1.14%, respectively.
arXiv Detail & Related papers (2023-12-14T08:46:26Z) - Locally Differentially Private Document Generation Using Zero Shot
Prompting [61.20953109732442]
We propose a locally differentially private mechanism called DP-Prompt to counter author de-anonymization attacks.
When DP-Prompt is used with a powerful language model like ChatGPT (gpt-3.5), we observe a notable reduction in the success rate of de-anonymization attacks.
arXiv Detail & Related papers (2023-10-24T18:25:13Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - Holistic Evaluation of Language Models [183.94891340168175]
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood.
We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
arXiv Detail & Related papers (2022-11-16T18:51:34Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.