Towards Accountability for Machine Learning Datasets: Practices from
Software Engineering and Infrastructure
- URL: http://arxiv.org/abs/2010.13561v2
- Date: Sat, 30 Jan 2021 00:12:54 GMT
- Title: Towards Accountability for Machine Learning Datasets: Practices from
Software Engineering and Infrastructure
- Authors: Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina
Greer, Oddur Kjartansson, Parker Barnes, Margaret Mitchell
- Abstract summary: datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation.
This paper introduces a rigorous framework for dataset development transparency which supports decision-making and accountability.
- Score: 9.825840279544465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rising concern for the societal implications of artificial intelligence
systems has inspired demands for greater transparency and accountability.
However the datasets which empower machine learning are often used, shared and
re-used with little visibility into the processes of deliberation which led to
their creation. Which stakeholder groups had their perspectives included when
the dataset was conceived? Which domain experts were consulted regarding how to
model subgroups and other phenomena? How were questions of representational
biases measured and addressed? Who labeled the data? In this paper, we
introduce a rigorous framework for dataset development transparency which
supports decision-making and accountability. The framework uses the cyclical,
infrastructural and engineering nature of dataset development to draw on best
practices from the software development lifecycle. Each stage of the data
development lifecycle yields a set of documents that facilitate improved
communication and decision-making, as well as drawing attention the value and
necessity of careful data work. The proposed framework is intended to
contribute to closing the accountability gap in artificial intelligence
systems, by making visible the often overlooked work that goes into dataset
creation.
Related papers
- Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators [0.5755004576310334]
We interviewed 18 leading dataset creators about the current state of the field.
We shed light on the challenges and considerations faced by dataset creators.
We share seven central recommendations for improving responsible dataset creation.
arXiv Detail & Related papers (2024-08-30T20:52:19Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Human-Centric Multimodal Machine Learning: Recent Advances and Testbed
on AI-based Recruitment [66.91538273487379]
There is a certain consensus about the need to develop AI applications with a Human-Centric approach.
Human-Centric Machine Learning needs to be developed based on four main requirements: (i) utility and social good; (ii) privacy and data ownership; (iii) transparency and accountability; and (iv) fairness in AI-driven decision-making processes.
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
arXiv Detail & Related papers (2023-02-13T16:44:44Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - A survey on datasets for fairness-aware machine learning [6.962333053044713]
A large variety of fairness-aware machine learning solutions have been proposed.
In this paper, we overview real-world datasets used for fairness-aware machine learning.
For a deeper understanding of bias and fairness in the datasets, we investigate the interesting relationships using exploratory analysis.
arXiv Detail & Related papers (2021-10-01T16:54:04Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z) - Bringing the People Back In: Contesting Benchmark Machine Learning
Datasets [11.00769651520502]
We outline a research program - a genealogy of machine learning data - for investigating how and why these datasets have been created.
We describe the ways in which benchmark datasets in machine learning operate as infrastructure and pose four research questions for these datasets.
arXiv Detail & Related papers (2020-07-14T23:22:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.