Related papers: TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

URL: http://arxiv.org/abs/2505.17841v1
Date: Fri, 23 May 2025 12:55:33 GMT
Title: TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation
Authors: Wiebke Hutiri, Mircea Cimpoi, Morgan Scheuerman, Victoria Matthews, Alice Xiang,
Abstract summary: We introduce TEDI, which encompasses 143 indicators that characterize trustworthy and ethical attributes of multimodal datasets.<n>Using TEDI, we manually annotated and analyzed over 100 multimodal datasets that include human voices.<n>We find that only a select few datasets have documented attributes and practices pertaining to consent, privacy, and harmful content indicators.
Score: 3.1695945518308366
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dataset transparency is a key enabler of responsible AI, but insights into multimodal dataset attributes that impact trustworthy and ethical aspects of AI applications remain scarce and are difficult to compare across datasets. To address this challenge, we introduce Trustworthy and Ethical Dataset Indicators (TEDI) that facilitate the systematic, empirical analysis of dataset documentation. TEDI encompasses 143 fine-grained indicators that characterize trustworthy and ethical attributes of multimodal datasets and their collection processes. The indicators are framed to extract verifiable information from dataset documentation. Using TEDI, we manually annotated and analyzed over 100 multimodal datasets that include human voices. We further annotated data sourcing, size, and modality details to gain insights into the factors that shape trustworthy and ethical dimensions across datasets. We find that only a select few datasets have documented attributes and practices pertaining to consent, privacy, and harmful content indicators. The extent to which these and other ethical indicators are addressed varies based on the data collection method, with documentation of datasets collected via crowdsourced and direct collection approaches being more likely to mention them. Scraping dominates scale at the cost of ethical indicators, but is not the only viable collection method. Our approach and empirical insights contribute to increasing dataset transparency along trustworthy and ethical dimensions and pave the way for automating the tedious task of extracting information from dataset documentation in future.

Related papers

DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI [24.349800949355465]
Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets.<n>We propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction.
arXiv Detail & Related papers (2024-12-09T08:47:05Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts [0.0]
This paper introduces the MERIT dataset, a fully labeled dataset within the context of school reports. By its nature, the MERIT dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs) To demonstrate the dataset's utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models.
arXiv Detail & Related papers (2024-08-31T12:56:38Z)
Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner. We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
Revisiting Table Detection Datasets for Visually Rich Documents [17.846536373106268]
This study revisits some open datasets with high-quality annotations, identifies and cleans the noise, and aligns the annotation definitions of these datasets to merge a larger dataset, termed Open-Tables. To enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open datasets. Our experimental results show that the domain differences among existing open datasets are minor despite having different data sources.
arXiv Detail & Related papers (2023-05-04T01:08:15Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders. We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z)
SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level. We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.