A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety
- URL: http://arxiv.org/abs/2503.00020v1
- Date: Sun, 23 Feb 2025 00:59:04 GMT
- Title: A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety
- Authors: Rakeen Rouf, Trupti Bavalatti, Osama Ahmed, Dhaval Potdar, Faraz Jawed,
- Abstract summary: Novel research aimed at text-to-image (T2I) generative AI safety often relies on publicly available datasets for training and evaluation.<n>This paper presents a review of the key datasets used in the T2I research, detailing their collection methods, compositions, semantic and syntactic diversity of prompts and the quality, coverage, and distribution of harm types in the datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Novel research aimed at text-to-image (T2I) generative AI safety often relies on publicly available datasets for training and evaluation, making the quality and composition of these datasets crucial. This paper presents a comprehensive review of the key datasets used in the T2I research, detailing their collection methods, compositions, semantic and syntactic diversity of prompts and the quality, coverage, and distribution of harm types in the datasets. By highlighting the strengths and limitations of the datasets, this study enables researchers to find the most relevant datasets for a use case, critically assess the downstream impacts of their work given the dataset distribution, particularly regarding model safety and ethical considerations, and also identify the gaps in dataset coverage and quality that future research may address.
Related papers
- Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation [66.66243874361103]
dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data.
We propose Concept-Aware LoRA, a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts for domain alignment.
We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain settings.
arXiv Detail & Related papers (2025-03-28T06:23:29Z) - Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers [0.0]
This paper presents a machine learning framework that automates dataset mention detection across research domains.<n>We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset.<n>At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall.
arXiv Detail & Related papers (2025-02-14T16:16:02Z) - DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI [24.349800949355465]
Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets.<n>We propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction.
arXiv Detail & Related papers (2024-12-09T08:47:05Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - Navigating Dataset Documentations in AI: A Large-Scale Analysis of
Dataset Cards on Hugging Face [46.60562029098208]
We analyze all 7,433 dataset documentation on Hugging Face.
Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
arXiv Detail & Related papers (2024-01-24T21:47:13Z) - Assessing Dataset Quality Through Decision Tree Characteristics in
Autoencoder-Processed Spaces [0.30458514384586394]
We show the profound impact of dataset quality on model training and performance.
Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality.
This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
arXiv Detail & Related papers (2023-06-27T11:33:31Z) - Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study
on Telematics Data with ChatGPT [0.0]
This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT.
To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset.
arXiv Detail & Related papers (2023-06-23T15:15:13Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.