Related papers: Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

URL: http://arxiv.org/abs/2503.06987v1
Date: Mon, 10 Mar 2025 07:06:47 GMT
Title: Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
Authors: Jiho Jin, Woosung Kang, Junho Myung, Alice Oh,
Abstract summary: We propose a Bias Benchmark for Generation (BBG) to evaluate social bias in long-form generation.<n>We measure the probability of neutral and biased generations across ten large language models (LLMs)<n>We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.
Score: 15.045809510740218
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.

Related papers

Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks [3.973239756262797]
This study examines such biases in open-generation benchmarks like BOLD and SAGED. Results reveal unequal treatment of demographic descriptors, calling for more robust bias metric models.
arXiv Detail & Related papers (2024-10-14T20:08:40Z)
Compare without Despair: Reliable Preference Evaluation with Generation Separability [20.50638483427141]
We introduce a measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters.
arXiv Detail & Related papers (2024-07-02T01:37:56Z)
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model [72.13121434085116]
We introduce VLBiasBench, a benchmark to evaluate biases in Large Vision-Language Models (LVLMs)<n>VLBiasBench features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status.<n>We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models.
arXiv Detail & Related papers (2024-06-20T10:56:59Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
Analyzing Social Biases in Japanese Large Language Models [24.351580958043595]
We construct the Japanese Bias Benchmark dataset for Question Answering (JBBQ) based on the English bias benchmark BBQ. We analyze social biases in Japanese Large Language Models (LLMs) prompts with warnings about social biases and Chain-of-Thought prompting reduce the effect of biases in model outputs.
arXiv Detail & Related papers (2024-06-04T07:31:06Z)
Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [49.3814117521631]
Standard benchmarks of bias and fairness in large language models (LLMs) measure the association between social attributes implied in user prompts and short responses.<n>We develop analogous RUTEd evaluations from three contexts of real-world use.<n>We find that standard bias metrics have no significant correlation with the more realistic bias metrics.
arXiv Detail & Related papers (2024-02-20T01:49:15Z)
GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community. The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability. We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z)
The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks [75.58692290694452]
We compare social biases with non-social biases stemming from choices made during dataset construction that might not even be discernible to the human eye. We observe that these shallow modifications have a surprising effect on the resulting degree of bias across various models.
arXiv Detail & Related papers (2022-10-18T17:58:39Z)
BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation [89.41378346080603]
This work presents the first systematic study on the social bias in PLM-based metrics. We demonstrate that popular PLM-based metrics exhibit significantly higher social bias than traditional metrics on 6 sensitive attributes. In addition, we develop debiasing adapters that are injected into PLM layers, mitigating bias in PLM-based metrics while retaining high performance for evaluating text generation.
arXiv Detail & Related papers (2022-10-14T08:24:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.