Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models
- URL: http://arxiv.org/abs/2208.13080v3
- Date: Sun, 7 Jul 2024 15:44:26 GMT
- Title: Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models
- Authors: Ethan Pickering, Themistoklis P. Sapsis,
- Abstract summary: Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models.
We present a sequential selection method that identifies critically important information within a dataset.
We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models. We present a Bayesian sequential selection method, akin to Bayesian experimental design, that identifies critically important information within a dataset, while ignoring data that is either misleading or brings unnecessary complexity to the surrogate model of choice. Our method improves sample-wise error convergence and eliminates instances where more data leads to worse performance and instabilities of the surrogate model, often termed sample-wise ``double descent''. We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails. Our approach has two key features. First, the selection algorithm dynamically couples the chosen model and data. Data is chosen based on its merits towards improving the selected model, rather than being compared strictly against other data. Second, a natural convergence of the method removes the need for dividing the data into training, testing, and validation sets. Instead, the selection metric inherently assesses testing and validation error through global statistics of the model. This ensures that key information is never wasted in testing or validation. The method is applied using both Gaussian process regression and deep neural network surrogate models.
Related papers
- Data Pruning in Generative Diffusion Models [2.0111637969968]
Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets.
We show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically.
arXiv Detail & Related papers (2024-11-19T14:13:25Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Machine Learning Data Suitability and Performance Testing Using Fault
Injection Testing Framework [0.0]
This paper presents the Fault Injection for Undesirable Learning in input Data (FIUL-Data) testing framework.
Data mutators explore vulnerabilities of ML systems against the effects of different fault injections.
This paper evaluates the framework using data from analytical chemistry, comprising retention time measurements of anti-sense oligonucleotides.
arXiv Detail & Related papers (2023-09-20T12:58:35Z) - Semi-Supervised Learning with Multiple Imputations on Non-Random Missing
Labels [0.0]
Semi-Supervised Learning (SSL) is implemented when algorithms are trained on both labeled and unlabeled data.
This paper proposes two new methods of combining multiple imputation models to achieve higher accuracy and less bias.
arXiv Detail & Related papers (2023-08-15T04:09:53Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Eeny, meeny, miny, moe. How to choose data for morphological inflection [8.914777617216862]
This paper explores four sampling strategies for the task of morphological inflection using a Transformer model.
We investigate the robustness of each strategy across 30 typologically diverse languages.
Our results show a clear benefit to selecting data based on model confidence and entropy.
arXiv Detail & Related papers (2022-10-26T04:33:18Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Time Series Anomaly Detection with label-free Model Selection [0.6303112417588329]
We propose LaF-AD, a novel anomaly detection algorithm with label-free model selection for unlabeled times-series data.
Our algorithm is easily parallelizable, more robust for ill-conditioned and seasonal data, and highly scalable for a large number of anomaly models.
arXiv Detail & Related papers (2021-06-11T00:21:06Z) - Data from Model: Extracting Data from Non-robust and Robust Models [83.60161052867534]
This work explores the reverse process of generating data from a model, attempting to reveal the relationship between the data and the model.
We repeat the process of Data to Model (DtM) and Data from Model (DfM) in sequence and explore the loss of feature mapping information.
Our results show that the accuracy drop is limited even after multiple sequences of DtM and DfM, especially for robust models.
arXiv Detail & Related papers (2020-07-13T05:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.