Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?
- URL: http://arxiv.org/abs/2404.12691v1
- Date: Fri, 19 Apr 2024 07:42:35 GMT
- Title: Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?
- Authors: Shayne Longpre, Robert Mahari, Naana Obeng-Marnu, William Brannon, Tobin South, Katy Gero, Sandy Pentland, Jad Kabbara,
- Abstract summary: New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections.
Existing practices in data collection have led to challenges in documenting data transparency, tracing authenticity, verifying consent, privacy, representation, bias, copyright infringement.
- Score: 11.040101172803727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in documenting data transparency, tracing authenticity, verifying consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy foundation models. In response, regulation is emphasizing the need for training data transparency to understand foundation models' limitations. Based on a large-scale analysis of the foundation model training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible foundation model development practices. We examine the current shortcomings of common tools for tracing data authenticity, consent, and documentation, and outline how policymakers, developers, and data creators can facilitate responsible foundation model development by adopting universal data provenance standards.
Related papers
- The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources [100.23208165760114]
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications.
To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet.
arXiv Detail & Related papers (2024-06-24T15:55:49Z) - Trustless Audits without Revealing Data or Models [49.23322187919369]
We show that it is possible to allow model providers to keep their model weights (but not architecture) and data secret while allowing other parties to trustlessly audit model and data properties.
We do this by designing a protocol called ZkAudit in which model providers publish cryptographic commitments of datasets and model weights.
arXiv Detail & Related papers (2024-04-06T04:43:06Z) - Generative Models are Self-Watermarked: Declaring Model Authentication
through Re-Generation [17.88043926057354]
verifying data ownership poses formidable challenges, particularly in cases of unauthorized reuse of generated data.
Our work is dedicated to detecting data reuse from even an individual sample.
We propose an explainable verification procedure that attributes data ownership through re-generation, and further amplifies these fingerprints in the generative models through iterative data re-generation.
arXiv Detail & Related papers (2024-02-23T10:48:21Z) - The Foundation Model Transparency Index [55.862805799199194]
The Foundation Model Transparency Index specifies 100 indicators that codify transparency for foundation models.
We score developers in relation to their practices for their flagship foundation model.
Overall, the Index establishes the level of transparency today to drive progress on foundation model governance.
arXiv Detail & Related papers (2023-10-19T17:39:02Z) - Collect, Measure, Repeat: Reliability Factors for Responsible AI Data
Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner.
We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - On the Opportunities and Risks of Foundation Models [256.61956234436553]
We call these models foundation models to underscore their critically central yet incomplete character.
This report provides a thorough account of the opportunities and risks of foundation models.
To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration.
arXiv Detail & Related papers (2021-08-16T17:50:08Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.