Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
- URL: http://arxiv.org/abs/2304.02828v1
- Date: Thu, 6 Apr 2023 02:33:51 GMT
- Title: Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
- Authors: Noa Garcia, Yusuke Hirota, Yankun Wu, Yuta Nakashima
- Abstract summary: Even small but manually annotated datasets, such as MSCOCO, are affected by societal bias.
Our first contribution is to annotate part of the Google Conceptual Captions dataset, widely used for training vision-and-language models.
Second contribution is to conduct a comprehensive analysis of the annotations, focusing on how different demographic groups are represented.
Third contribution is to evaluate three prevailing vision-and-language tasks, showing that societal bias is a persistent problem in all of them.
- Score: 21.421722941901123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing tendency to collect large and uncurated datasets to train
vision-and-language models has raised concerns about fair representations. It
is known that even small but manually annotated datasets, such as MSCOCO, are
affected by societal bias. This problem, far from being solved, may be getting
worse with data crawled from the Internet without much control. In addition,
the lack of tools to analyze societal bias in big collections of images makes
addressing the problem extremely challenging. Our first contribution is to
annotate part of the Google Conceptual Captions dataset, widely used for
training vision-and-language models, with four demographic and two contextual
attributes. Our second contribution is to conduct a comprehensive analysis of
the annotations, focusing on how different demographic groups are represented.
Our last contribution lies in evaluating three prevailing vision-and-language
tasks: image captioning, text-image CLIP embeddings, and text-to-image
generation, showing that societal bias is a persistent problem in all of them.
Related papers
- Identifying Implicit Social Biases in Vision-Language Models [34.53206726136747]
We conduct a systematic analysis of the social biases that are present in vision-language models.
We find that CLIP frequently displays undesirable associations between harmful words and specific demographic groups.
Our findings highlight the importance of evaluating and addressing bias in vision-language models.
arXiv Detail & Related papers (2024-11-01T19:41:28Z) - Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives [69.36723767339001]
We propose a novel framework named textitGPT4SGG to obtain more accurate and comprehensive scene graph signals.
We show textitGPT4SGG significantly improves the performance of SGG models trained on image-caption data.
arXiv Detail & Related papers (2023-12-07T14:11:00Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - Probing Intersectional Biases in Vision-Language Models with
Counterfactual Examples [5.870913541790421]
We employ text-to-image diffusion models to produce counterfactual examples for probing intserctional social biases at scale.
Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs.
We conduct extensive experiments using our generated dataset which reveal the intersectional social biases present in state-of-the-art VLMs.
arXiv Detail & Related papers (2023-10-04T17:25:10Z) - Social Biases through the Text-to-Image Generation Lens [9.137275391251517]
Text-to-Image (T2I) generation is enabling new applications that support creators, designers, and general end users of productivity software.
We take a multi-dimensional approach to studying and quantifying common social biases as reflected in the generated images.
We present findings for two popular T2I models: DALLE-v2 and Stable Diffusion.
arXiv Detail & Related papers (2023-03-30T05:29:13Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Assessing Demographic Bias Transfer from Dataset to Model: A Case Study
in Facial Expression Recognition [1.5340540198612824]
Two metrics focus on the representational and stereotypical bias of the dataset, and the third one on the residual bias of the trained model.
We demonstrate the usefulness of the metrics by applying them to a FER problem based on the popular Affectnet dataset.
arXiv Detail & Related papers (2022-05-20T09:40:42Z) - A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models
with Adversarial Learning [55.96577490779591]
Vision-language models can encode societal biases and stereotypes.
There are challenges to measuring and mitigating these multimodal harms.
We investigate bias measures and apply ranking metrics for image-text representations.
arXiv Detail & Related papers (2022-03-22T17:59:04Z) - Multimodal datasets: misogyny, pornography, and malignant stereotypes [2.8682942808330703]
We examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset.
We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.
arXiv Detail & Related papers (2021-10-05T11:47:27Z) - REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets [64.76453161039973]
REVISE (REvealing VIsual biaSEs) is a tool that assists in the investigation of a visual dataset.
It surfacing potential biases along three dimensions: (1) object-based, (2) person-based, and (3) geography-based.
arXiv Detail & Related papers (2020-04-16T23:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.