Data Aggregation for Reducing Training Data in Symbolic Regression
- URL: http://arxiv.org/abs/2108.10660v1
- Date: Tue, 24 Aug 2021 11:58:17 GMT
- Title: Data Aggregation for Reducing Training Data in Symbolic Regression
- Authors: Lukas Kammerer, Gabriel Kronberger, Michael Kommenda
- Abstract summary: This work discusses methods to reduce the training data and thereby also the runtime of genetic programming.
K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method.
The performance of genetic programming is compared with random forests and linear regression.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing volume of data makes the use of computationally intense machine
learning techniques such as symbolic regression with genetic programming more
and more impractical. This work discusses methods to reduce the training data
and thereby also the runtime of genetic programming. The data is aggregated in
a preprocessing step before running the actual machine learning algorithm.
K-means clustering and data binning is used for data aggregation and compared
with random sampling as the simplest data reduction method. We analyze the
achieved speed-up in training and the effects on the trained models test
accuracy for every method on four real-world data sets. The performance of
genetic programming is compared with random forests and linear regression. It
is shown, that k-means and random sampling lead to very small loss in test
accuracy when the data is reduced down to only 30% of the original data, while
the speed-up is proportional to the size of the data set. Binning on the
contrary, leads to models with very high test error.
Related papers
- Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks.
We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process.
Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z) - Post-training Model Quantization Using GANs for Synthetic Data
Generation [57.40733249681334]
We investigate the use of synthetic data as a substitute for the calibration with real data for the quantization method.
We compare the performance of models quantized using data generated by StyleGAN2-ADA and our pre-trained DiStyleGAN, with quantization using real data and an alternative data generation method based on fractal images.
arXiv Detail & Related papers (2023-05-10T11:10:09Z) - Too Fine or Too Coarse? The Goldilocks Composition of Data Complexity
for Robust Left-Right Eye-Tracking Classifiers [0.0]
We train machine learning models utilizing a mixed dataset composed of both fine- and coarse-grain data.
For our purposes, finer-grain data refers to data collected using more complex methods whereas coarser-grain data refers to data collected using more simple methods.
arXiv Detail & Related papers (2022-08-24T23:18:08Z) - Gradient-guided Loss Masking for Neural Machine Translation [27.609155878513334]
In this paper, we explore strategies that dynamically optimize data usage during the training process.
Our algorithm calculates the gradient alignment between the training data and the clean data to mask out data with negative alignment.
Experiments on three WMT language pairs show that our method brings significant improvement over strong baselines.
arXiv Detail & Related papers (2021-02-26T15:41:48Z) - Online Missing Value Imputation and Change Point Detection with the
Gaussian Copula [21.26330349034669]
Missing value imputation is crucial for real-world data science.
We develop an online imputation algorithm for mixed data using the Gaussian copula.
arXiv Detail & Related papers (2020-09-25T16:27:47Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.