Related papers: Access Denied: Meaningful Data Access for Quantitative Algorithm Audits

Access Denied: Meaningful Data Access for Quantitative Algorithm Audits

URL: http://arxiv.org/abs/2502.00428v1
Date: Sat, 01 Feb 2025 13:33:45 GMT
Title: Access Denied: Meaningful Data Access for Quantitative Algorithm Audits
Authors: Juliette Zaccour, Reuben Binns, Luc Rocher,
Abstract summary: Third-party audits are often hindered by access restrictions, forcing auditors to rely on limited, low-quality data.<n>We conduct audit simulations on two realistic case studies for recidivism and healthcare coverage prediction.<n>We find that data minimization and anonymization practices can strongly increase error rates on individual-level data, leading to unreliable assessments.
Score: 4.182284365432724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Independent algorithm audits hold the promise of bringing accountability to automated decision-making. However, third-party audits are often hindered by access restrictions, forcing auditors to rely on limited, low-quality data. To study how these limitations impact research integrity, we conduct audit simulations on two realistic case studies for recidivism and healthcare coverage prediction. We examine the accuracy of estimating group parity metrics across three levels of access: (a) aggregated statistics, (b) individual-level data with model outputs, and (c) individual-level data without model outputs. Despite selecting one of the simplest tasks for algorithmic auditing, we find that data minimization and anonymization practices can strongly increase error rates on individual-level data, leading to unreliable assessments. We discuss implications for independent auditors, as well as potential avenues for HCI researchers and regulators to improve data access and enable both reliable and holistic evaluations.

Related papers

Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks [4.44828379498865]
Regulation-driven audits increasingly rely on global fairness metrics.<n>We show how to manipulate data samples to artificially satisfy fairness criteria.<n>We then study how to detect such manipulation.
arXiv Detail & Related papers (2025-07-28T11:01:48Z)
Beyond Internal Data: Constructing Complete Datasets for Fairness Testing [26.037607208689977]
This work focuses on evaluating classifier fairness when complete datasets including demographics are inaccessible.<n>We propose leveraging separate overlapping datasets to construct complete synthetic data that includes demographic information.<n>We validate the fidelity of the synthetic data by comparing it to real data, and empirically demonstrate that fairness metrics derived from testing on such synthetic data are consistent with those obtained from real data.
arXiv Detail & Related papers (2025-07-24T16:35:42Z)
UniAud: A Unified Auditing Framework for High Auditing Power and Utility with One Training Run [9.400936999321415]
We propose a unified framework, UniAud, for data-independent auditing.<n>We then extend this framework as UniAud++ for data-dependent auditing.<n>We show that our framework matches the state-of-the-art auditing results of O(T) auditing with thousands of runs.
arXiv Detail & Related papers (2025-07-06T16:35:48Z)
Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs [58.24692529185971]
We introduce a comprehensive auditing framework for unlearning evaluation comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods.<n>We evaluate the effectiveness and robustness of different unlearning strategies.
arXiv Detail & Related papers (2025-05-29T09:19:07Z)
Anomaly Detection in Double-entry Bookkeeping Data by Federated Learning System with Non-model Sharing Approach [3.827294988616478]
Anomaly detection is crucial in financial auditing and effective detection often requires obtaining large volumes of data from multiple organizations.<n>In this study, we propose a novel framework employing Data Collaboration (DC) analysis to streamline model training into a single communication round.<n>Our findings represent a significant advance in artificial intelligence-driven auditing and underscore the potential of FL methods in high-security domains.
arXiv Detail & Related papers (2025-01-22T08:53:12Z)
Towards Explainable Automated Data Quality Enhancement without Domain Knowledge [0.0]
We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset. Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence. We adopt a hybrid approach that integrates statistical methods with machine learning algorithms.
arXiv Detail & Related papers (2024-09-16T10:08:05Z)
The Data Minimization Principle in Machine Learning [61.17813282782266]
Data minimization aims to reduce the amount of data collected, processed or retained. It has been endorsed by various global data protection regulations. However, its practical implementation remains a challenge due to the lack of a rigorous formulation.
arXiv Detail & Related papers (2024-05-29T19:40:27Z)
Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings. Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z)
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z)
CEDAR: Communication Efficient Distributed Analysis for Regressions [9.50726756006467]
There are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data. We propose a novel communication efficient method that aggregates the local optimal estimates, by turning the problem into a missing data problem. We provide theoretical investigation for the properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses.
arXiv Detail & Related papers (2022-07-01T09:53:44Z)
Uncertainty Minimization for Personalized Federated Semi-Supervised Learning [15.123493340717303]
We propose a novel semi-supervised learning paradigm which allows partial-labeled or unlabeled clients to seek labeling assistance from data-related clients (helper agents) Experiments show that our proposed method can obtain superior performance and more stable convergence than other related works with partial labeled data.
arXiv Detail & Related papers (2022-05-05T04:41:27Z)
Multi-view Contrastive Self-Supervised Learning of Accounting Data Representations for Downstream Audit Tasks [1.9659095632676094]
International audit standards require the direct assessment of a financial statement's underlying accounting transactions, referred to as journal entries. Deep learning inspired audit techniques have emerged in the field of auditing vast quantities of journal entry data. We propose a contrastive self-supervised learning framework designed to learn audit task invariant accounting data representations.
arXiv Detail & Related papers (2021-09-23T08:16:31Z)
Learning to Limit Data Collection via Scaling Laws: Data Minimization Compliance in Practice [62.44110411199835]
We build on literature in machine learning law to propose framework for limiting collection based on data interpretation that ties data to system performance. We formalize a data minimization criterion based on performance curve derivatives and provide an effective and interpretable piecewise power law technique.
arXiv Detail & Related papers (2021-07-16T19:59:01Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
Operationalizing the Legal Principle of Data Minimization for Personalization [64.0027026050706]
We identify a lack of a homogeneous interpretation of the data minimization principle and explore two operational definitions applicable in the context of personalization. We find that the performance decrease incurred by data minimization might not be substantial, but it might disparately impact different users.
arXiv Detail & Related papers (2020-05-28T00:43:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.