Related papers: Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

URL: http://arxiv.org/abs/2206.02923v1
Date: Mon, 6 Jun 2022 21:55:39 GMT
Title: Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata
Authors: Amy Heger, Elizabeth B. Marquis, Mihaela Vorvoreanu, Hanna Wallach, Jennifer Wortman Vaughan
Abstract summary: Data is central to the development and evaluation of machine learning (ML) models. To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation. There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
Score: 10.689661834716613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advocate for increased data documentation and have proposed several data documentation frameworks. However, there is little research on whether these data documentation frameworks meet the needs of ML practitioners, who both create and consume datasets. To address this gap, we set out to understand ML practitioners' data documentation perceptions, needs, challenges, and desiderata, with the goal of deriving design requirements that can inform future data documentation frameworks. We conducted a series of semi-structured interviews with 14 ML practitioners at a single large, international technology company. We had them answer a list of questions taken from datasheets for datasets (Gebru, 2021). Our findings show that current approaches to data documentation are largely ad hoc and myopic in nature. Participants expressed needs for data documentation frameworks to be adaptable to their contexts, integrated into their existing tools and workflows, and automated wherever possible. Despite the fact that data documentation frameworks are often motivated from the perspective of responsible AI, participants did not make the connection between the questions that they were asked to answer and their responsible AI implications. In addition, participants often had difficulties prioritizing the needs of dataset consumers and providing information that someone unfamiliar with their datasets might need to know. Based on these findings, we derive seven design requirements for future data documentation frameworks.

Related papers

Data Requirement Goal Modeling for Machine Learning Systems [0.8854624631197942]
This work proposes an approach to guide non-experts in identifying data requirements for Machine Learning systems. We first develop the Data Requirement Goal Model (DRGM) by surveying the white literature. We then validate the approach through two illustrative examples based on real-world projects.
arXiv Detail & Related papers (2025-04-10T11:30:25Z)
Datasheets for AI and medical datasets (DAIMS): a data validation and documentation framework before machine learning analysis in medical research [0.0]
We extend the framework to "Datasheets for AI and medical datasets - DAIMS" Our publicly available solution, DAIMS, provides a checklist including data standardization requirements. The checklist consists of 24 common data standardization requirements, where the tool checks and validate a subset of them.
arXiv Detail & Related papers (2025-01-23T21:02:56Z)
Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs [0.061446808540639365]
This work explores the usage of Knowledge Graphs (KG) as a basic framework for capturing a human-centered manner complex analytics. The data stored in the generated KG can then be exploited to provide assistance (e.g., recommendations) to the users interacting with these systems.
arXiv Detail & Related papers (2024-11-01T20:45:23Z)
Synthetic Data Generation with Large Language Models for Personalized Community Question Answering [47.300506002171275]
We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities. Our findings suggest that LLMs have high potential in generating data tailored to users' needs. The synthetic data can replace human-written training data, even if the generated data may contain incorrect information.
arXiv Detail & Related papers (2024-10-29T16:19:08Z)
Data Formulator 2: Iteratively Creating Rich Visualizations with AI [65.48447317310442]
We present Data Formulator 2, an LLM-powered visualization system to address these challenges. With Data Formulator 2, users describe their visualization intent with blended UI and natural language inputs, and data transformation are delegated to AI. To support iteration, Data Formulator 2 lets users navigate their iteration history and reuse previous designs towards new ones so that they don't need to start from scratch every time.
arXiv Detail & Related papers (2024-08-28T20:12:17Z)
A Standardized Machine-readable Dataset Documentation Format for Responsible AI [8.59437843168878]
Croissant-RAI is a machine-readable metadata format designed to enhance the discoverability, interoperability, and trustworthiness of AI datasets. It is integrated into major data search engines, repositories, and machine learning frameworks.
arXiv Detail & Related papers (2024-06-04T16:40:14Z)
Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework [1.5993707490601146]
We evaluate data practices in machine learning as data curation practices. We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles.
arXiv Detail & Related papers (2024-05-04T16:21:05Z)
The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements. LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z)
Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face [46.60562029098208]
We analyze all 7,433 dataset documentation on Hugging Face. Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
arXiv Detail & Related papers (2024-01-24T21:47:13Z)
Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets. We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Documenting Data Production Processes: A Participatory Approach for Data Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems. Previous research has proposed standardized checklists to document datasets. This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z)
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders. We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z)
REGRAD: A Large-Scale Relational Grasp Dataset for Safe and Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps. Our dataset is collected in both forms of 2D images and 3D point clouds. Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.