On minimizing the training set fill distance in machine learning
regression
- URL: http://arxiv.org/abs/2307.10988v2
- Date: Tue, 5 Dec 2023 13:23:55 GMT
- Title: On minimizing the training set fill distance in machine learning
regression
- Authors: Paolo Climaco and Jochen Garcke
- Abstract summary: We study a data selection approach that aims to minimize the fill distance of the selected set.
We show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches.
- Score: 0.6526824510982802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For regression tasks one often leverages large datasets for training
predictive machine learning models. However, using large datasets may not be
feasible due to computational limitations or high data labelling costs.
Therefore, suitably selecting small training sets from large pools of
unlabelled data points is essential to maximize model performance while
maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a
data selection approach that aims to minimize the fill distance of the selected
set. We derive an upper bound for the maximum expected prediction error,
conditional to the location of the unlabelled data points, that linearly
depends on the training set fill distance. For empirical validation, we perform
experiments using two regression models on three datasets. We empirically show
that selecting a training set by aiming to minimize the fill distance, thereby
minimizing our derived bound, significantly reduces the maximum prediction
error of various regression models, outperforming alternative sampling
approaches by a large margin. Furthermore, we show that selecting training sets
with the FPS can also increase model stability for the specific case of
Gaussian kernel regression approaches.
Related papers
- Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - SwiftLearn: A Data-Efficient Training Method of Deep Learning Models
using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models.
This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages.
We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z) - Remember to correct the bias when using deep learning for regression! [13.452510519858992]
When training deep learning models for least-squares regression, we cannot expect that the training error residuals of the final model, selected after a fixed training time, sum to zero.
We suggest to adjust the bias of the machine learning model after training as a default postprocessing step, which efficiently solves the problem.
arXiv Detail & Related papers (2022-03-30T17:09:03Z) - Mixing Deep Learning and Multiple Criteria Optimization: An Application
to Distributed Learning with Multiple Datasets [0.0]
Training phase is the most important stage during the machine learning process.
We develop a multiple criteria optimization model in which each criterion measures the distance between the output associated with a specific input and its label.
We propose a scalarization approach to implement this model and numerical experiments in digit classification using MNIST data.
arXiv Detail & Related papers (2021-12-02T16:00:44Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Training Data Subset Selection for Regression with Controlled
Generalization Error [19.21682938684508]
We develop an efficient majorization-minimization algorithm for data subset selection.
SELCON trades off accuracy and efficiency more effectively than the current state-of-the-art.
arXiv Detail & Related papers (2021-06-23T16:03:55Z) - Variational Bayesian Unlearning [54.26984662139516]
We study the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased.
We show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief.
In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging.
arXiv Detail & Related papers (2020-10-24T11:53:00Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.