Related papers: It's My Data Too: Private ML for Datasets with Multi-User Training Examples

It's My Data Too: Private ML for Datasets with Multi-User Training Examples

URL: http://arxiv.org/abs/2503.03622v1
Date: Wed, 05 Mar 2025 16:02:09 GMT
Title: It's My Data Too: Private ML for Datasets with Multi-User Training Examples
Authors: Arun Ganesh, Ryan McKenna, Brendan McMahan, Adam Smith, Fan Wu,
Abstract summary: We first provide a carefully chosen definition of user-level DP under the multi-attribution model.<n>We propose a greedy baseline algorithm for the contribution bounding problem.<n>We study variants of this baseline algorithm that optimize the subset chosen using different techniques and criteria.
Score: 9.18252846535411
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We initiate a study of algorithms for model training with user-level differential privacy (DP), where each example may be attributed to multiple users, which we call the multi-attribution model. We first provide a carefully chosen definition of user-level DP under the multi-attribution model. Training in the multi-attribution model is facilitated by solving the contribution bounding problem, i.e. the problem of selecting a subset of the dataset for which each user is associated with a limited number of examples. We propose a greedy baseline algorithm for the contribution bounding problem. We then empirically study this algorithm for a synthetic logistic regression task and a transformer training task, including studying variants of this baseline algorithm that optimize the subset chosen using different techniques and criteria. We find that the baseline algorithm remains competitive with its variants in most settings, and build a better understanding of the practical importance of a bias-variance tradeoff inherent in solutions to the contribution bounding problem.

Related papers

Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation [0.0]
We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain.<n>Our approach combines utility and diversity metrics to select the most informative and representative training examples.
arXiv Detail & Related papers (2025-05-02T18:20:44Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Factorization of Multi-Agent Sampling-Based Motion Planning [72.42734061131569]
Modern robotics often involves multiple embodied agents operating within a shared environment. Standard sampling-based algorithms can be used to search for solutions in the robots' joint space. We integrate the concept of factorization into sampling-based algorithms, which requires only minimal modifications to existing methods. We present a general implementation of a factorized SBA, derive an analytical gain in terms of sample complexity for PRM*, and showcase empirical results for RRG.
arXiv Detail & Related papers (2023-04-01T15:50:18Z)
Probabilistic Bilevel Coreset Selection [24.874967723659022]
We propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample. We develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation.
arXiv Detail & Related papers (2023-01-24T09:37:00Z)
Multi-Task Learning for Sparsity Pattern Heterogeneity: Statistical and Computational Perspectives [10.514866749547558]
We consider a problem in Multi-Task Learning (MTL) where multiple linear models are jointly trained on a collection of datasets. A key novelty of our framework is that it allows the sparsity pattern of regression coefficients and the values of non-zero coefficients to differ across tasks. Our methods encourage models to share information across tasks through separately encouraging 1) coefficient supports, and/or 2) nonzero coefficient values to be similar. This allows models to borrow strength during variable selection even when non-zero coefficient values differ across tasks.
arXiv Detail & Related papers (2022-12-16T19:52:25Z)
Domain Adaptation Principal Component Analysis: base linear method for learning with out-of-distribution data [55.41644538483948]
Domain adaptation is a popular paradigm in modern machine learning. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) DAPCA finds a linear reduced data representation useful for solving the domain adaptation task.
arXiv Detail & Related papers (2022-08-28T21:10:56Z)
Pareto Set Learning for Neural Multi-objective Combinatorial Optimization [6.091096843566857]
Multiobjective optimization (MOCO) problems can be found in many real-world applications. We develop a learning-based approach to approximate the whole Pareto set for a given MOCO problem without further search procedure. Our proposed method significantly outperforms some other methods on the multiobjective traveling salesman problem, multiconditioned vehicle routing problem and multi knapsack problem in terms of solution quality, speed, and model efficiency.
arXiv Detail & Related papers (2022-03-29T09:26:22Z)
A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled. We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples. We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z)
Rank-Based Multi-task Learning for Fair Regression [9.95899391250129]
We develop a novel learning approach for multi-taskart regression models based on a biased dataset. We use a popular non-parametric oracle-based non-world multipliers dataset.
arXiv Detail & Related papers (2020-09-23T22:32:57Z)
Continual Learning using a Bayesian Nonparametric Dictionary of Weight Factors [75.58555462743585]
Naively trained neural networks tend to experience catastrophic forgetting in sequential task settings. We propose a principled nonparametric approach based on the Indian Buffet Process (IBP) prior, letting the data determine how much to expand the model complexity. We demonstrate the effectiveness of our method on a number of continual learning benchmarks and analyze how weight factors are allocated and reused throughout the training.
arXiv Detail & Related papers (2020-04-21T15:20:19Z)
Learn to Expect the Unexpected: Probably Approximately Correct Domain Generalization [38.345670899258515]
Domain generalization is the problem of machine learning when the training data and the test data come from different data domains. We present a simple theoretical model of learning to generalize across domains in which there is a meta-distribution over data distributions.
arXiv Detail & Related papers (2020-02-13T17:37:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.