Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments
- URL: http://arxiv.org/abs/2312.06153v2
- Date: Thu, 28 Mar 2024 02:20:36 GMT
- Title: Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments
- Authors: Anthony Cintron Roman, Jennifer Wortman Vaughan, Valerie See, Steph Ballard, Jehu Torres, Caleb Robinson, Juan M. Lavista Ferres,
- Abstract summary: This paper introduces a no-code, machine-readable documentation framework for open datasets.
The framework aims to improve comprehensibility, and usability of open datasets.
The framework is expected to enhance the quality and reliability of data used in research and decision-making.
- Score: 9.125552623625806
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper introduces a no-code, machine-readable documentation framework for open datasets, with a focus on responsible AI (RAI) considerations. The framework aims to improve comprehensibility, and usability of open datasets, facilitating easier discovery and use, better understanding of content and context, and evaluation of dataset quality and accuracy. The proposed framework is designed to streamline the evaluation of datasets, helping researchers, data scientists, and other open data users quickly identify datasets that meet their needs and organizational policies or regulations. The paper also discusses the implementation of the framework and provides recommendations to maximize its potential. The framework is expected to enhance the quality and reliability of data used in research and decision-making, fostering the development of more responsible and trustworthy AI systems.
Related papers
- Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators [0.5755004576310334]
We interviewed 18 leading dataset creators about the current state of the field.
We shed light on the challenges and considerations faced by dataset creators.
We share seven central recommendations for improving responsible dataset creation.
arXiv Detail & Related papers (2024-08-30T20:52:19Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - A Standardized Machine-readable Dataset Documentation Format for Responsible AI [8.59437843168878]
Croissant-RAI is a machine-readable metadata format designed to enhance the discoverability, interoperability, and trustworthiness of AI datasets.
It is integrated into major data search engines, repositories, and machine learning frameworks.
arXiv Detail & Related papers (2024-06-04T16:40:14Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - AUGUST: an Automatic Generation Understudy for Synthesizing
Conversational Recommendation Datasets [56.052803235932686]
We propose a novel automatic dataset synthesis approach that can generate both large-scale and high-quality recommendation dialogues.
In doing so, we exploit: (i) rich personalized user profiles from traditional recommendation datasets, (ii) rich external knowledge from knowledge graphs, and (iii) the conversation ability contained in human-to-human conversational recommendation datasets.
arXiv Detail & Related papers (2023-06-16T05:27:14Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata [10.689661834716613]
Data is central to the development and evaluation of machine learning (ML) models.
To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation.
There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
arXiv Detail & Related papers (2022-06-06T21:55:39Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - CateCom: a practical data-centric approach to categorization of
computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models.
We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z) - Towards Accountability for Machine Learning Datasets: Practices from
Software Engineering and Infrastructure [9.825840279544465]
datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation.
This paper introduces a rigorous framework for dataset development transparency which supports decision-making and accountability.
arXiv Detail & Related papers (2020-10-23T01:57:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.