GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction
- URL: http://arxiv.org/abs/2405.15760v1
- Date: Fri, 24 May 2024 17:56:03 GMT
- Title: GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction
- Authors: Virginia K. Felkner, Jennifer A. Thompson, Jonathan May,
- Abstract summary: This paper explores whether GPT-3.5-Turbo can assist with the task of developing a bias benchmark dataset.
We extend the previous work to a new community and set of biases: the Jewish community and antisemitism.
Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output.
- Score: 25.17740839996496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo) can assist with the task of developing a bias benchmark dataset from responses to an open-ended community survey. We also extend the previous work to a new community and set of biases: the Jewish community and antisemitism. Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output. Thus, we conclude that GPT-3.5-Turbo is not an appropriate substitute for human annotation in sensitive tasks related to social biases, and that its use actually negates many of the benefits of community-sourcing bias benchmarks.
Related papers
- FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations [3.9681649902019136]
We introduce a benchmark, FAIRE, to test for racial and gender bias in large language models (LLMs) used to evaluate resumes.
Our findings reveal that while every model exhibits some degree of bias, the magnitude and direction vary considerably.
It highlights the urgent need for strategies to reduce bias in AI-driven recruitment.
arXiv Detail & Related papers (2025-04-02T07:11:30Z) - Benchmarking LLMs' Judgments with No Gold Standard [8.517244114791913]
We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs)
In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner.
We also present GRE-bench, which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers.
arXiv Detail & Related papers (2024-11-11T16:58:36Z) - With a Grain of SALT: Are LLMs Fair Across Social Dimensions? [3.5001789247699535]
This paper presents a systematic analysis of biases in open-source Large Language Models (LLMs) across gender, religion, and race.
We use the SALT dataset, which incorporates five distinct bias triggers: General Debate, Positioned Debate, Career Advice, Problem Solving, and CV Generation.
Our findings reveal consistent polarization across models, with certain demographic groups receiving systematically favorable or unfavorable treatment.
arXiv Detail & Related papers (2024-10-16T12:22:47Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [49.3814117521631]
Standard benchmarks of bias and fairness in large language models (LLMs) measure the association between social attributes implied in user prompts and short responses.
We develop analogous RUTEd evaluations from three contexts of real-world use.
We find that standard bias metrics have no significant correlation with the more realistic bias metrics.
arXiv Detail & Related papers (2024-02-20T01:49:15Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for
Social Media NLP Research [33.698581876383074]
We introduce a unified benchmark for NLP evaluation in social media, SuperTweetEval.
We benchmarked the performance of a wide range of models on SuperTweetEval and our results suggest that, despite the recent advances in language modelling, social media remains challenging.
arXiv Detail & Related papers (2023-10-23T09:48:25Z) - WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models [29.773734878738264]
WinoQueer is a benchmark designed to measure whether large language models (LLMs) encode biases that are harmful to the LGBTQ+ community.
We apply our benchmark to several popular LLMs and find that off-the-shelf models generally do exhibit considerable anti-queer bias.
arXiv Detail & Related papers (2023-06-26T22:07:33Z) - Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large
Language Model Recommendation [52.62492168507781]
We propose a novel benchmark called Fairness of Recommendation via LLM (FaiRLLM)
This benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes.
By utilizing our FaiRLLM benchmark, we conducted an evaluation of ChatGPT and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations.
arXiv Detail & Related papers (2023-05-12T16:54:36Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for
Text Generation [89.41378346080603]
This work presents the first systematic study on the social bias in PLM-based metrics.
We demonstrate that popular PLM-based metrics exhibit significantly higher social bias than traditional metrics on 6 sensitive attributes.
In addition, we develop debiasing adapters that are injected into PLM layers, mitigating bias in PLM-based metrics while retaining high performance for evaluating text generation.
arXiv Detail & Related papers (2022-10-14T08:24:11Z) - A Survey of Parameters Associated with the Quality of Benchmarks in NLP [24.6240575061124]
Recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task.
A potential solution to these issues -- a metric quantifying quality -- remains underexplored.
We take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark.
arXiv Detail & Related papers (2022-10-14T06:44:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.