On the Sample Complexity of Adversarial Multi-Source PAC Learning
- URL: http://arxiv.org/abs/2002.10384v2
- Date: Tue, 30 Jun 2020 14:22:51 GMT
- Title: On the Sample Complexity of Adversarial Multi-Source PAC Learning
- Authors: Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert
- Abstract summary: In a single-source setting, an adversary with the power to corrupt a fixed fraction of the training data can prevent PAC-learnability.
We show that, surprisingly, the same is not true in the multi-source setting, where the adversary can arbitrarily corrupt a fixed fraction of the data sources.
Our results also show that in a cooperative learning setting sharing data with other parties has provable benefits, even if some participants are malicious.
- Score: 46.24794665486056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of learning from multiple untrusted data sources, a
scenario of increasing practical relevance given the recent emergence of
crowdsourcing and collaborative learning paradigms. Specifically, we analyze
the situation in which a learning system obtains datasets from multiple
sources, some of which might be biased or even adversarially perturbed. It is
known that in the single-source case, an adversary with the power to corrupt a
fixed fraction of the training data can prevent PAC-learnability, that is, even
in the limit of infinitely much training data, no learning system can approach
the optimal test error. In this work we show that, surprisingly, the same is
not true in the multi-source setting, where the adversary can arbitrarily
corrupt a fixed fraction of the data sources. Our main results are a
generalization bound that provides finite-sample guarantees for this learning
setting, as well as corresponding lower bounds. Besides establishing
PAC-learnability our results also show that in a cooperative learning setting
sharing data with other parties has provable benefits, even if some
participants are malicious.
Related papers
- The Curse of Diversity in Ensemble-Based Exploration [7.209197316045156]
Training a diverse ensemble of data-sharing agents can significantly impair the performance of the individual ensemble members.
We name this phenomenon the curse of diversity.
We demonstrate the potential of representation learning to counteract the curse of diversity.
arXiv Detail & Related papers (2024-05-07T14:14:50Z) - Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Correcting Underrepresentation and Intersectional Bias for Classification [49.1574468325115]
We consider the problem of learning from data corrupted by underrepresentation bias.
We show that with a small amount of unbiased data, we can efficiently estimate the group-wise drop-out rates.
We show that our algorithm permits efficient learning for model classes of finite VC dimension.
arXiv Detail & Related papers (2023-06-19T18:25:44Z) - Investigating Multi-source Active Learning for Natural Language
Inference [34.18663328309923]
We show that four popular active learning schemes fail to outperform random selection when applied to unlabelled pools comprised of multiple data sources on the task of natural language inference.
We reveal that uncertainty-based strategies perform poorly due to the acquisition of collective outliers.
In further analysis, we find that collective outliers vary in form between sources, and show that hard-to-learn data is not always categorically harmful.
arXiv Detail & Related papers (2023-02-14T11:10:18Z) - Non-IID data and Continual Learning processes in Federated Learning: A
long road ahead [58.720142291102135]
Federated Learning is a novel framework that allows multiple devices or institutions to train a machine learning model collaboratively while preserving their data private.
In this work, we formally classify data statistical heterogeneity and review the most remarkable learning strategies that are able to face it.
At the same time, we introduce approaches from other machine learning frameworks, such as Continual Learning, that also deal with data heterogeneity and could be easily adapted to the Federated Learning settings.
arXiv Detail & Related papers (2021-11-26T09:57:11Z) - On Covariate Shift of Latent Confounders in Imitation and Reinforcement
Learning [69.48387059607387]
We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning.
We analyze the limitations of learning from confounded expert data with and without external reward.
We validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.
arXiv Detail & Related papers (2021-10-13T07:31:31Z) - FLEA: Provably Fair Multisource Learning from Unreliable Training Data [28.382147902475545]
We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress data sources that would have a negative impact on fairness or accuracy.
We show the effectiveness of our approach by a diverse range of experiments on multiple datasets.
arXiv Detail & Related papers (2021-06-22T13:09:45Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z) - Probabilistic Inference for Learning from Untrusted Sources [6.811310452498163]
Federated learning brings potential benefits of faster learning, better solutions, and a greater propensity to transfer when heterogeneous data from different parties increases diversity.
It is important for the aggregation algorithm to be robust to non-IID data and corrupted parties.
Recent work assumes that a textitreference dataset is available through which to perform the identification.
We consider settings where no such reference dataset is available; rather, the quality and suitability of the parties needs to be textitinferred.
We propose novel federated learning aggregation algorithms based on Bayesian inference that adapt to the quality of the parties.
arXiv Detail & Related papers (2021-01-15T15:30:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.