Related papers: Learning from Aggregated Data: Curated Bags versus Random Bags

Learning from Aggregated Data: Curated Bags versus Random Bags

URL: http://arxiv.org/abs/2305.09557v2
Date: Thu, 18 May 2023 17:13:26 GMT
Title: Learning from Aggregated Data: Curated Bags versus Random Bags
Authors: Lin Chen, Gang Fu, Amin Karbasi, Vahab Mirrokni
Abstract summary: We explore the possibility of training machine learning models with aggregated data labels, rather than individual labels. For the curated bag setting, we show that we can perform gradient-based learning without any degradation in performance. In the random bag setting, there is a trade-off between size of the bag and the achievable error rate as our bound indicates.
Score: 35.394402088653415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Protecting user privacy is a major concern for many machine learning systems that are deployed at scale and collect from a diverse set of population. One way to address this concern is by collecting and releasing data labels in an aggregated manner so that the information about a single user is potentially combined with others. In this paper, we explore the possibility of training machine learning models with aggregated data labels, rather than individual labels. Specifically, we consider two natural aggregation procedures suggested by practitioners: curated bags where the data points are grouped based on common features and random bags where the data points are grouped randomly in bag of similar sizes. For the curated bag setting and for a broad range of loss functions, we show that we can perform gradient-based learning without any degradation in performance that may result from aggregating data. Our method is based on the observation that the sum of the gradients of the loss function on individual data examples in a curated bag can be computed from the aggregate label without the need for individual labels. For the random bag setting, we provide a generalization risk bound based on the Rademacher complexity of the hypothesis class and show how empirical risk minimization can be regularized to achieve the smallest risk bound. In fact, in the random bag setting, there is a trade-off between size of the bag and the achievable error rate as our bound indicates. Finally, we conduct a careful empirical study to confirm our theoretical findings. In particular, our results suggest that aggregate learning can be an effective method for preserving user privacy while maintaining model accuracy.

Related papers

Aggregating Data for Optimal and Private Learning [13.283323029489507]
Multiple Instance Regression (MIR) and Learning from Label Proportions (LLP) are learning frameworks. We study for various loss functions in MIR and LLP, what is the optimal way to partition the dataset into bags.
arXiv Detail & Related papers (2024-11-28T10:44:00Z)
Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning. One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems. We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z)
Weak to Strong Learning from Aggregate Labels [9.804335415337071]
We study the problem of using a weak learner on such training bags with aggregate labels to obtain a strong learner. A weak learner has at a constant accuracy 1 on the training bags, while a strong learner's accuracy can be arbitrarily close to 1. Our work is the first to theoretically study weak to strong learning from aggregate labels, with an algorithm to achieve the same for LLP.
arXiv Detail & Related papers (2024-11-09T14:56:09Z)
Theoretical Proportion Label Perturbation for Learning from Label Proportions in Large Bags [5.842419815638353]
Learning from label proportions (LLP) is a weakly supervised learning that trains an instance-level classifier from label proportions of bags. A challenge in LLP arises when the number of instances in a bag (bag size) is numerous, making the traditional LLP methods difficult due to GPU memory limitations. This study aims to develop an LLP method capable of learning from bags with large sizes.
arXiv Detail & Related papers (2024-08-26T09:24:36Z)
Learning from Aggregate responses: Instance Level versus Bag Level Loss Functions [23.32422115080128]
In many practical applications the training data is aggregated before being shared with the learner, in order to protect privacy of users' sensitive responses. We study two natural loss functions for learning from aggregate responses: bag-level loss and the instance-level loss. We propose a mechanism for differentially private learning from aggregate responses and derive the optimal bag size in terms of prediction risk-privacy trade-off.
arXiv Detail & Related papers (2024-01-20T02:14:11Z)
Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points. It cannot be assumed that all users sample from the same underlying distribution. We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z)
Learning from Multiple Unlabeled Datasets with Partial Risk Regularization [80.54710259664698]
In this paper, we aim to learn an accurate classifier without any class labels. We first derive an unbiased estimator of the classification risk that can be estimated from the given unlabeled sets. We then find that the classifier obtained as such tends to cause overfitting as its empirical risks go negative during training. Experiments demonstrate that our method effectively mitigates overfitting and outperforms state-of-the-art methods for learning from multiple unlabeled sets.
arXiv Detail & Related papers (2022-07-04T16:22:44Z)
Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions. We propose an algorithm that optimize for the worst-off group assignments from a constraint set. We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z)
Fast learning from label proportions with small bags [0.0]
In learning from label proportions (LLP), the instances are grouped into bags, and the task is to learn an instance classifier given relative class proportions in training bags. In this work, we focus on the case of small bags, which allows designing more efficient algorithms by explicitly considering all consistent label combinations.
arXiv Detail & Related papers (2021-10-07T13:11:18Z)
Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z)
Certainty Pooling for Multiple Instance Learning [0.6299766708197883]
We present a novel pooling operator called textbfCertainty Pooling which incorporates the model certainty into bag predictions. Our method outperforms other methods in both bag level and instance level prediction, especially when only small training sets are available.
arXiv Detail & Related papers (2020-08-24T16:38:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.