Related papers: A Standardized Machine-readable Dataset Documentation Format for Responsible AI

A Standardized Machine-readable Dataset Documentation Format for Responsible AI

URL: http://arxiv.org/abs/2407.16883v1
Date: Tue, 4 Jun 2024 16:40:14 GMT
Title: A Standardized Machine-readable Dataset Documentation Format for Responsible AI
Authors: Nitisha Jain, Mubashara Akhtar, Joan Giner-Miguelez, Rajat Shinde, Joaquin Vanschoren, Steffen Vogler, Sujata Goswami, Yuhan Rao, Tim Santos, Luis Oala, Michalis Karamousadakis, Manil Maskey, Pierre Marcenac, Costanza Conforti, Michael Kuchnik, Lora Aroyo, Omar Benjelloun, Elena Simperl,
Abstract summary: Croissant-RAI is a machine-readable metadata format designed to enhance the discoverability, interoperability, and trustworthiness of AI datasets. It is integrated into major data search engines, repositories, and machine learning frameworks.
Score: 8.59437843168878
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data is critical to advancing AI technologies, yet its quality and documentation remain significant challenges, leading to adverse downstream effects (e.g., potential biases) in AI applications. This paper addresses these issues by introducing Croissant-RAI, a machine-readable metadata format designed to enhance the discoverability, interoperability, and trustworthiness of AI datasets. Croissant-RAI extends the Croissant metadata format and builds upon existing responsible AI (RAI) documentation frameworks, offering a standardized set of attributes and practices to facilitate community-wide adoption. Leveraging established web-publishing practices, such as Schema.org, Croissant-RAI enables dataset users to easily find and utilize RAI metadata regardless of the platform on which the datasets are published. Furthermore, it is seamlessly integrated into major data search engines, repositories, and machine learning frameworks, streamlining the reading and writing of responsible AI metadata within practitioners' existing workflows. Croissant-RAI was developed through a community-led effort. It has been designed to be adaptable to evolving documentation requirements and is supported by a Python library and a visual editor.

Related papers

Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction [83.0216122783429]
Web Reconstruction (WebR) is a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents. We show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks.
arXiv Detail & Related papers (2025-04-22T04:07:13Z)
Automated Archival Descriptions with Federated Intelligence of LLMs [2.271344459418284]
This work aims at exploring the potential of agentic AI and large language models (LLMs) in addressing the challenges of implementing a standardized archival description process. We introduce an agentic AI-driven system for automated generation of high-quality metadata descriptions of archival materials.
arXiv Detail & Related papers (2025-04-08T06:11:05Z)
Towards Human-Guided, Data-Centric LLM Co-Pilots [53.35493881390917]
CliMB-DC is a human-guided, data-centric framework for machine learning co-pilots. It combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. We show how CliMB-DC can transform uncurated datasets into ML-ready formats.
arXiv Detail & Related papers (2025-01-17T17:51:22Z)
DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights [8.139817615390147]
This paper introduces the Domain Adaptive Visually-rich Document Understanding (DAViD) framework. DAViD integrates fine-grained and coarse-grained document representation learning and employs synthetic annotations to reduce the need for costly manual labelling.
arXiv Detail & Related papers (2024-10-02T14:47:55Z)
The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements. LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z)
Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments [9.125552623625806]
This paper introduces a no-code, machine-readable documentation framework for open datasets. The framework aims to improve comprehensibility, and usability of open datasets. The framework is expected to enhance the quality and reliability of data used in research and decision-making.
arXiv Detail & Related papers (2023-12-11T06:41:14Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Data Efficient Training of a U-Net Based Architecture for Structured Documents Localization [0.0]
We propose SDL-Net: a novel U-Net like encoder-decoder architecture for the localization of structured documents. Our approach allows pre-training the encoder of SDL-Net on a generic dataset containing samples of various document classes.
arXiv Detail & Related papers (2023-10-02T07:05:19Z)
Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata [10.689661834716613]
Data is central to the development and evaluation of machine learning (ML) models. To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation. There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
arXiv Detail & Related papers (2022-06-06T21:55:39Z)
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders. We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender. We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z)
Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG) It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains. Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level. We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.