Multimodal datasets: misogyny, pornography, and malignant stereotypes
- URL: http://arxiv.org/abs/2110.01963v1
- Date: Tue, 5 Oct 2021 11:47:27 GMT
- Title: Multimodal datasets: misogyny, pornography, and malignant stereotypes
- Authors: Abeba Birhane, Vinay Uday Prabhu and Emmanuel Kahembwe
- Abstract summary: We examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset.
We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.
- Score: 2.8682942808330703
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We have now entered the era of trillion parameter machine learning models
trained on billion-sized datasets scraped from the internet. The rise of these
gargantuan datasets has given rise to formidable bodies of critical work that
has called for caution while generating these large datasets. These address
concerns surrounding the dubious curation practices used to generate these
datasets, the sordid quality of alt-text data available on the world wide web,
the problematic content of the CommonCrawl dataset often used as a source for
training large language models, and the entrenched biases in large-scale
visio-linguistic models (such as OpenAI's CLIP model) trained on opaque
datasets (WebImageText). In the backdrop of these specific calls of caution, we
examine the recently released LAION-400M dataset, which is a CLIP-filtered
dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset. We found
that the dataset contains, troublesome and explicit images and text pairs of
rape, pornography, malign stereotypes, racist and ethnic slurs, and other
extremely problematic content. We outline numerous implications, concerns and
downstream harms regarding the current state of large scale datasets while
raising open questions for various stakeholders including the AI community,
regulators, policy makers and data subjects.
Related papers
- RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models.
These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis.
We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - Diversify Your Vision Datasets with Automatic Diffusion-Based
Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains.
To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information.
We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z) - Uncurated Image-Text Datasets: Shedding Light on Demographic Bias [21.421722941901123]
Even small but manually annotated datasets, such as MSCOCO, are affected by societal bias.
Our first contribution is to annotate part of the Google Conceptual Captions dataset, widely used for training vision-and-language models.
Second contribution is to conduct a comprehensive analysis of the annotations, focusing on how different demographic groups are represented.
Third contribution is to evaluate three prevailing vision-and-language tasks, showing that societal bias is a persistent problem in all of them.
arXiv Detail & Related papers (2023-04-06T02:33:51Z) - The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z) - On The State of Data In Computer Vision: Human Annotations Remain
Indispensable for Developing Deep Learning Models [0.0]
High-quality labeled datasets play a crucial role in fueling the development of machine learning (ML)
Since the emergence of the ImageNet dataset and the AlexNet model in 2012, the size of new open-source labeled vision datasets has remained roughly constant.
Only a minority of publications in the computer vision community tackle supervised learning on datasets that are orders of magnitude larger than Imagenet.
arXiv Detail & Related papers (2021-07-31T00:08:21Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.