Minimum variance threshold for epsilon-lexicase selection
- URL: http://arxiv.org/abs/2404.05909v1
- Date: Mon, 8 Apr 2024 23:47:26 GMT
- Title: Minimum variance threshold for epsilon-lexicase selection
- Authors: Guilherme Seidyo Imai Aldeia, Fabricio Olivetti de Franca, William G. La Cava,
- Abstract summary: Methods often rely on average error over the entire dataset as a criterion to select the parents.
We propose a new criteria that splits errors into two partitions that minimize the total variance within partitions.
Our results show a better performance of our approach compared to traditional epsilon-lexicase selection in the real-world datasets.
- Score: 0.7373617024876725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parent selection plays an important role in evolutionary algorithms, and many strategies exist to select the parent pool before breeding the next generation. Methods often rely on average error over the entire dataset as a criterion to select the parents, which can lead to an information loss due to aggregation of all test cases. Under epsilon-lexicase selection, the population goes to a selection pool that is iteratively reduced by using each test individually, discarding individuals with an error higher than the elite error plus the median absolute deviation (MAD) of errors for that particular test case. In an attempt to better capture differences in performance of individuals on cases, we propose a new criteria that splits errors into two partitions that minimize the total variance within partitions. Our method was embedded into the FEAT symbolic regression algorithm, and evaluated with the SRBench framework, containing 122 black-box synthetic and real-world regression problems. The empirical results show a better performance of our approach compared to traditional epsilon-lexicase selection in the real-world datasets while showing equivalent performance on the synthetic dataset.
Related papers
- High-dimensional logistic regression with missing data: Imputation, regularization, and universality [7.167672851569787]
We study high-dimensional, ridge-regularized logistic regression.
We provide exact characterizations of both the prediction error and the estimation error.
arXiv Detail & Related papers (2024-10-01T21:41:21Z) - WHOMP: Optimizing Randomized Controlled Trials via Wasserstein Homogeneity [3.05179671246628]
We introduce a novel partitioning method called the $textitWasserstein Homogeneity Partition$ (WHOMP)
WHOMP optimally minimizes type I and type II errors that often result from imbalanced group splitting or partitioning.
arXiv Detail & Related papers (2024-09-27T07:38:47Z) - Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing [24.553384023323332]
We propose an approach to test for performance disparities based on Conditional Value-at-Risk.
We show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups.
arXiv Detail & Related papers (2023-12-06T19:25:32Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - SUnAA: Sparse Unmixing using Archetypal Analysis [62.997667081978825]
This paper introduces a new geological error map technique using archetypal sparse analysis (SUnAA)
First, we design a new model based on archetypal sparse analysis (SUnAA)
arXiv Detail & Related papers (2023-08-09T07:58:33Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Down-Sampled Epsilon-Lexicase Selection for Real-World Symbolic
Regression Problems [1.8047694351309207]
Down-sampled epsilon-lexicase selection combines epsilon-lexicase selection with random subsampling to improve the performance in the domain of symbolic regression.
We find that the diversity is reduced by using down-sampled epsilon-lexicase selection compared to standard epsilon-lexicase selection.
With down-sampled epsilon-lexicase selection we observe an improvement of the solution quality of up to 85% in comparison to standard epsilon-lexicase selection.
arXiv Detail & Related papers (2023-02-08T19:36:26Z) - GMOTE: Gaussian based minority oversampling technique for imbalanced
classification adapting tail probability of outliers [0.0]
Data-level approaches mainly use the oversampling methods to solve the problem, such as synthetic minority oversampling Technique (SMOTE)
In this paper, we proposed Gaussian based minority oversampling technique (GMOTE) with a statistical perspective for imbalanced datasets.
When the GMOTE is combined with classification and regression tree (CART) or support vector machine (SVM), it shows better accuracy and F1-Score.
arXiv Detail & Related papers (2021-05-09T07:04:37Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Minimax Active Learning [61.729667575374606]
Active learning aims to develop label-efficient algorithms by querying the most representative samples to be labeled by a human annotator.
Current active learning techniques either rely on model uncertainty to select the most uncertain samples or use clustering or reconstruction to choose the most diverse set of unlabeled examples.
We develop a semi-supervised minimax entropy-based active learning algorithm that leverages both uncertainty and diversity in an adversarial manner.
arXiv Detail & Related papers (2020-12-18T19:03:40Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.