Collect, Measure, Repeat: Reliability Factors for Responsible AI Data
Collection
- URL: http://arxiv.org/abs/2308.12885v2
- Date: Wed, 27 Sep 2023 14:03:30 GMT
- Title: Collect, Measure, Repeat: Reliability Factors for Responsible AI Data
Collection
- Authors: Oana Inel, Tim Draws and Lora Aroyo
- Abstract summary: We argue that data collection for AI should be performed in a responsible manner.
We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
- Score: 8.12993269922936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid entry of machine learning approaches in our daily activities and
high-stakes domains demands transparency and scrutiny of their fairness and
reliability. To help gauge machine learning models' robustness, research
typically focuses on the massive datasets used for their deployment, e.g.,
creating and maintaining documentation for understanding their origin, process
of development, and ethical considerations. However, data collection for AI is
still typically a one-off practice, and oftentimes datasets collected for a
certain purpose or application are reused for a different problem.
Additionally, dataset annotations may not be representative over time, contain
ambiguous or erroneous annotations, or be unable to generalize across issues or
domains. Recent research has shown these practices might lead to unfair,
biased, or inaccurate outcomes. We argue that data collection for AI should be
performed in a responsible manner where the quality of the data is thoroughly
scrutinized and measured through a systematic set of appropriate metrics. In
this paper, we propose a Responsible AI (RAI) methodology designed to guide the
data collection with a set of metrics for an iterative in-depth analysis of the
factors influencing the quality and reliability} of the generated data. We
propose a granular set of measurements to inform on the internal reliability of
a dataset and its external stability over time. We validate our approach across
nine existing datasets and annotation tasks and four content modalities. This
approach impacts the assessment of data robustness used for AI applied in the
real world, where diversity of users and content is eminent. Furthermore, it
deals with fairness and accountability aspects in data collection by providing
systematic and transparent quality analysis for data collections.
Related papers
- Towards Explainable Automated Data Quality Enhancement without Domain Knowledge [0.0]
We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset.
Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence.
We adopt a hybrid approach that integrates statistical methods with machine learning algorithms.
arXiv Detail & Related papers (2024-09-16T10:08:05Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - Data AUDIT: Identifying Attribute Utility- and Detectability-Induced
Bias in Task Models [8.420252576694583]
We present a first technique for the rigorous, quantitative screening of medical image datasets.
Our method decomposes the risks associated with dataset attributes in terms of their detectability and utility.
Using our method, we show our screening method reliably identifies nearly imperceptible bias-inducing artifacts.
arXiv Detail & Related papers (2023-04-06T16:50:15Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Towards Handling Uncertainty-at-Source in AI -- A Review and Next Steps
for Interval Regression [6.166295570030645]
This paper focuses on linear regression for interval-valued data as a recent growth area.
We conduct an in-depth analysis of state-of-the-art methods, elucidating their behaviour, advantages, and pitfalls when applied to datasets with different properties.
arXiv Detail & Related papers (2021-04-15T05:31:10Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.