Learning from Aggregated Data: Curated Bags versus Random Bags
- URL: http://arxiv.org/abs/2305.09557v2
- Date: Thu, 18 May 2023 17:13:26 GMT
- Title: Learning from Aggregated Data: Curated Bags versus Random Bags
- Authors: Lin Chen, Gang Fu, Amin Karbasi, Vahab Mirrokni
- Abstract summary: We explore the possibility of training machine learning models with aggregated data labels, rather than individual labels.
For the curated bag setting, we show that we can perform gradient-based learning without any degradation in performance.
In the random bag setting, there is a trade-off between size of the bag and the achievable error rate as our bound indicates.
- Score: 35.394402088653415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protecting user privacy is a major concern for many machine learning systems
that are deployed at scale and collect from a diverse set of population. One
way to address this concern is by collecting and releasing data labels in an
aggregated manner so that the information about a single user is potentially
combined with others. In this paper, we explore the possibility of training
machine learning models with aggregated data labels, rather than individual
labels. Specifically, we consider two natural aggregation procedures suggested
by practitioners: curated bags where the data points are grouped based on
common features and random bags where the data points are grouped randomly in
bag of similar sizes. For the curated bag setting and for a broad range of loss
functions, we show that we can perform gradient-based learning without any
degradation in performance that may result from aggregating data. Our method is
based on the observation that the sum of the gradients of the loss function on
individual data examples in a curated bag can be computed from the aggregate
label without the need for individual labels. For the random bag setting, we
provide a generalization risk bound based on the Rademacher complexity of the
hypothesis class and show how empirical risk minimization can be regularized to
achieve the smallest risk bound. In fact, in the random bag setting, there is a
trade-off between size of the bag and the achievable error rate as our bound
indicates. Finally, we conduct a careful empirical study to confirm our
theoretical findings. In particular, our results suggest that aggregate
learning can be an effective method for preserving user privacy while
maintaining model accuracy.
Related papers
- Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning.
One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems.
We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z) - Weak to Strong Learning from Aggregate Labels [9.804335415337071]
We study the problem of using a weak learner on such training bags with aggregate labels to obtain a strong learner.
A weak learner has at a constant accuracy 1 on the training bags, while a strong learner's accuracy can be arbitrarily close to 1.
Our work is the first to theoretically study weak to strong learning from aggregate labels, with an algorithm to achieve the same for LLP.
arXiv Detail & Related papers (2024-11-09T14:56:09Z) - Theoretical Proportion Label Perturbation for Learning from Label Proportions in Large Bags [5.842419815638353]
Learning from label proportions (LLP) is a weakly supervised learning that trains an instance-level classifier from label proportions of bags.
A challenge in LLP arises when the number of instances in a bag (bag size) is numerous, making the traditional LLP methods difficult due to GPU memory limitations.
This study aims to develop an LLP method capable of learning from bags with large sizes.
arXiv Detail & Related papers (2024-08-26T09:24:36Z) - Learning from Aggregate responses: Instance Level versus Bag Level Loss
Functions [23.32422115080128]
In many practical applications the training data is aggregated before being shared with the learner, in order to protect privacy of users' sensitive responses.
We study two natural loss functions for learning from aggregate responses: bag-level loss and the instance-level loss.
We propose a mechanism for differentially private learning from aggregate responses and derive the optimal bag size in terms of prediction risk-privacy trade-off.
arXiv Detail & Related papers (2024-01-20T02:14:11Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Learning from Multiple Unlabeled Datasets with Partial Risk
Regularization [80.54710259664698]
In this paper, we aim to learn an accurate classifier without any class labels.
We first derive an unbiased estimator of the classification risk that can be estimated from the given unlabeled sets.
We then find that the classifier obtained as such tends to cause overfitting as its empirical risks go negative during training.
Experiments demonstrate that our method effectively mitigates overfitting and outperforms state-of-the-art methods for learning from multiple unlabeled sets.
arXiv Detail & Related papers (2022-07-04T16:22:44Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Fast learning from label proportions with small bags [0.0]
In learning from label proportions (LLP), the instances are grouped into bags, and the task is to learn an instance classifier given relative class proportions in training bags.
In this work, we focus on the case of small bags, which allows designing more efficient algorithms by explicitly considering all consistent label combinations.
arXiv Detail & Related papers (2021-10-07T13:11:18Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Certainty Pooling for Multiple Instance Learning [0.6299766708197883]
We present a novel pooling operator called textbfCertainty Pooling which incorporates the model certainty into bag predictions.
Our method outperforms other methods in both bag level and instance level prediction, especially when only small training sets are available.
arXiv Detail & Related papers (2020-08-24T16:38:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.