Five reasons against assuming a data-generating distribution in Machine Learning
- URL: http://arxiv.org/abs/2407.17395v1
- Date: Wed, 24 Jul 2024 16:17:14 GMT
- Title: Five reasons against assuming a data-generating distribution in Machine Learning
- Authors: Benedikt Höltgen, Robert C. Williamson,
- Abstract summary: We argue that the concept of a data-generating probability distribution is not always a good model.
We suggest an alternative framework that focuses on finite populations rather than abstract distributions.
- Score: 7.09435109588801
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine Learning research, as most of Statistics, heavily relies on the concept of a data-generating probability distribution. As data points are thought to be sampled from such a distribution, we can learn from observed data about this distribution and, thus, predict future data points drawn from it (with some probability of success). Drawing on scholarship across disciplines, we here argue that this framework is not always a good model. Not only do such true probability distributions not exist; the framework can also be misleading and obscure both the choices made and the goals pursued in machine learning practice. We suggest an alternative framework that focuses on finite populations rather than abstract distributions; while classical learning theory can be left almost unchanged, it opens new opportunities, especially to model sampling. We compile these considerations into five reasons for modelling machine learning -- in some settings -- with finite distributions rather than generative distributions, both to be more faithful to practice and to provide novel theoretical insights.
Related papers
- Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Ask Your Distribution Shift if Pre-Training is Right for You [74.18516460467019]
In practice, fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others.
We focus on two possible failure modes of models under distribution shift: poor extrapolation and biases in the training data.
Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases.
arXiv Detail & Related papers (2024-02-29T23:46:28Z) - SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [49.94607673097326]
We propose a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data.
Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization algorithm.
Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios.
arXiv Detail & Related papers (2024-02-21T03:39:04Z) - Credal Learning Theory [4.64390130376307]
We lay the foundations for a credal' theory of learning, using convex sets of probabilities to model the variability in the data-generating distribution.
Bounds are derived for the case of finite hypotheses spaces as well as infinite model spaces, which directly generalize classical results.
arXiv Detail & Related papers (2024-02-01T19:25:58Z) - Dr. FERMI: A Stochastic Distributionally Robust Fair Empirical Risk
Minimization Framework [12.734559823650887]
In the presence of distribution shifts, fair machine learning models may behave unfairly on test data.
Existing algorithms require full access to data and cannot be used when small batches are used.
This paper proposes the first distributionally robust fairness framework with convergence guarantees that do not require knowledge of the causal graph.
arXiv Detail & Related papers (2023-09-20T23:25:28Z) - End-to-End Trajectory Distribution Prediction Based on Occupancy Grid
Maps [29.67295706224478]
In this paper, we aim to forecast a future trajectory distribution of a moving agent in the real world, given the social scene images and historical trajectories.
We learn the distribution with symmetric cross-entropy using occupancy grid maps as an explicit and scene-compliant approximation to the ground-truth distribution.
In experiments, our method achieves state-of-the-art performance on the Stanford Drone dataset and Intersection Drone dataset.
arXiv Detail & Related papers (2022-03-31T09:24:32Z) - Discovering Invariant Rationales for Graph Neural Networks [104.61908788639052]
Intrinsic interpretability of graph neural networks (GNNs) is to find a small subset of the input graph's features.
We propose a new strategy of discovering invariant rationale (DIR) to construct intrinsically interpretable GNNs.
arXiv Detail & Related papers (2022-01-30T16:43:40Z) - On some theoretical limitations of Generative Adversarial Networks [77.34726150561087]
It is a general assumption that GANs can generate any probability distribution.
We provide a new result based on Extreme Value Theory showing that GANs can't generate heavy tailed distributions.
arXiv Detail & Related papers (2021-10-21T06:10:38Z) - Out of Distribution Generalization in Machine Learning [0.0]
In everyday situations when models are tested in slightly different data than they were trained on, ML algorithms can fail spectacularly.
This research attempts to formally define this problem, what sets of assumptions are reasonable to make in our data.
Then, we focus on a certain class of out of distribution problems, their assumptions, and introduce simple algorithms that follow from these assumptions.
arXiv Detail & Related papers (2021-03-03T20:35:19Z) - A Note on High-Probability versus In-Expectation Guarantees of
Generalization Bounds in Machine Learning [95.48744259567837]
Statistical machine learning theory often tries to give generalization guarantees of machine learning models.
Statements made about the performance of machine learning models have to take the sampling process into account.
We show how one may transform one statement to another.
arXiv Detail & Related papers (2020-10-06T09:41:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.