Related papers: Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

URL: http://arxiv.org/abs/2304.10819v4
Date: Sun, 9 Jun 2024 18:40:20 GMT
Title: Auditing and Generating Synthetic Data with Controllable Trust Trade-offs
Authors: Brian Belgodere, Pierre Dognin, Adam Ivankay, Igor Melnyk, Youssef Mroueh, Aleksandra Mojsilovic, Jiri Navratil, Apoorva Nitsure, Inkit Padhi, Mattia Rigotti, Jerret Ross, Yair Schiff, Radhika Vedpathak, Richard A. Young,
Abstract summary: We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
Score: 54.262044436203965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues. This paradigm relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with "TrustFormers" across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.

Related papers

Beyond Internal Data: Constructing Complete Datasets for Fairness Testing [26.037607208689977]
This work focuses on evaluating classifier fairness when complete datasets including demographics are inaccessible.<n>We propose leveraging separate overlapping datasets to construct complete synthetic data that includes demographic information.<n>We validate the fidelity of the synthetic data by comparing it to real data, and empirically demonstrate that fairness metrics derived from testing on such synthetic data are consistent with those obtained from real data.
arXiv Detail & Related papers (2025-07-24T16:35:42Z)
Rethinking Data Protection in the (Generative) Artificial Intelligence Era [115.71019708491386]
We propose a four-level taxonomy that captures the diverse protection needs arising in modern (generative) AI models and systems.<n>Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline.
arXiv Detail & Related papers (2025-07-03T02:45:51Z)
Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding [59.50808215134678]
This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs.<n>Results reveal significant limitations in dynamic scene comprehension, cross-modal resilience and real-world risk mitigation.
arXiv Detail & Related papers (2025-06-14T04:04:54Z)
Quantitative Auditing of AI Fairness with Differentially Private Synthetic Data [0.30693357740321775]
Fairness auditing of AI systems can identify and quantify biases. Traditional auditing using real-world data raises security and privacy concerns. We propose a framework that leverages differentially private synthetic data to audit the fairness of AI systems.
arXiv Detail & Related papers (2025-04-30T13:36:27Z)
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective [333.9220561243189]
Generative Foundation Models (GenFMs) have emerged as transformative tools. Their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions.
arXiv Detail & Related papers (2025-02-20T06:20:36Z)
Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models -- A review and challenges for practice [0.3069335774032178]
It is crucial to empirically assess the privacy risks associated with the generated synthetic data before deploying generative technologies. This paper outlines the key concepts and assumptions underlying empirical privacy evaluation in machine learning-based generative and predictive models.
arXiv Detail & Related papers (2024-11-19T12:19:28Z)
Tabular Data Synthesis with Differential Privacy: A Survey [24.500349285858597]
Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights. Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data. Differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing.
arXiv Detail & Related papers (2024-11-04T06:32:48Z)
Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data [13.139215811928931]
This paper introduces a comprehensive framework for assessing synthetic retail data, focusing on fidelity, utility, and privacy. Our approach differentiates between continuous and discrete data attributes, providing precise evaluation criteria. Our findings validate that this framework provides reliable and scalable evaluation for synthetic retail data.
arXiv Detail & Related papers (2024-06-19T00:47:38Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Distributed Machine Learning and the Semblance of Trust [66.1227776348216]
Federated Learning (FL) allows the data owner to maintain data governance and perform model training locally without having to share their data. FL and related techniques are often described as privacy-preserving. We explain why this term is not appropriate and outline the risks associated with over-reliance on protocols that were not designed with formal definitions of privacy in mind.
arXiv Detail & Related papers (2021-12-21T08:44:05Z)
A Privacy-Preserving and Trustable Multi-agent Learning Framework [34.28936739262812]
This paper presents Privacy-preserving and trustable Distributed Learning (PT-DL) PT-DL is a fully decentralized framework that relies on Differential Privacy to guarantee strong privacy protections of the agents' data. The paper shows that PT-DL is resilient up to a 50% collusion attack, with high probability, in a malicious trust model.
arXiv Detail & Related papers (2021-06-02T15:46:27Z)
Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process. We generate a representative as well as fair version of the UCI Adult census data set. We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
Trustworthy Transparency by Design [57.67333075002697]
We propose a transparency framework for software design, incorporating research on user trust and experience. Our framework enables developing software that incorporates transparency in its design.
arXiv Detail & Related papers (2021-03-19T12:34:01Z)
Trustworthy AI [75.99046162669997]
Brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, are some of the most prominent limitations. We propose the tutorial on Trustworthy AI to address six critical issues in enhancing user and public trust in AI systems.
arXiv Detail & Related papers (2020-11-02T20:04:18Z)
Really Useful Synthetic Data -- A Framework to Evaluate the Quality of Differentially Private Synthetic Data [2.538209532048867]
Recent advances in generating synthetic data that allow to add principled ways of protecting privacy are a crucial step in sharing statistical information in a privacy preserving way. To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter. We develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective.
arXiv Detail & Related papers (2020-04-16T16:24:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.