Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm
- URL: http://arxiv.org/abs/2507.15132v1
- Date: Sun, 20 Jul 2025 21:42:30 GMT
- Title: Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm
- Authors: Joanna Komorniczak,
- Abstract summary: This work aims to increase the availability of datasets encompassing a diverse range of problem complexities.<n>For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected.<n>Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.
Related papers
- RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library [58.404895570822184]
RV-Syn is a novel mathematical Synthesis approach.<n>It generates graphs as solutions by combining Python-formatted functions from this library.<n>Based on the constructed graph, we achieve solution-guided logic-aware problem generation.
arXiv Detail & Related papers (2025-04-29T04:42:02Z) - Dataset Properties Shape the Success of Neuroimaging-Based Patient Stratification: A Benchmarking Analysis Across Clustering Algorithms [38.321248253111776]
We evaluated 4 widely used stratification algorithms, HYDRA, SuStaIn, SmileGAN, and SurrealGAN, on a suite of synthetic brain-morphometry cohorts.<n>Across 122 synthetic scenarios, data complexity consistently outweighed algorithm choice in predicting stratification success.<n>Well-separated clusters yielded high accuracy for all methods, whereas overlapping, unequal-sized, or subtle effects reduced accuracy by up to 50%.
arXiv Detail & Related papers (2025-03-15T09:44:00Z) - Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery [56.622854875204645]
We present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth gene-gene interactions.
A novel weighted diversified sampling algorithm computes the diversity score of each data sample in just two passes of the dataset.
arXiv Detail & Related papers (2024-10-21T03:35:23Z) - Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA)
We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity.
We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z) - Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics [44.97217246897902]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Personalized Decentralized Multi-Task Learning Over Dynamic
Communication Graphs [59.96266198512243]
We propose a decentralized and federated learning algorithm for tasks that are positively and negatively correlated.
Our algorithm uses gradients to calculate the correlations among tasks automatically, and dynamically adjusts the communication graph to connect mutually beneficial tasks and isolate those that may negatively impact each other.
We conduct experiments on a synthetic Gaussian dataset and a large-scale celebrity attributes (CelebA) dataset.
arXiv Detail & Related papers (2022-12-21T18:58:24Z) - Dataset Complexity Assessment Based on Cumulative Maximum Scaled Area
Under Laplacian Spectrum [38.65823547986758]
It is meaningful to predict classification performance by assessing the complexity of datasets effectively before training DCNN models.
This paper proposes a novel method called cumulative maximum scaled Area Under Laplacian Spectrum (cmsAULS)
arXiv Detail & Related papers (2022-09-29T13:02:04Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Complexity Measures for Multi-objective Symbolic Regression [2.4087148947930634]
Multi-objective symbolic regression has the advantage that while the accuracy of the learned models is maximized, the complexity is automatically adapted.
We study which complexity measures are most appropriately used in symbolic regression when performing multi- objective optimization with NSGA-II.
arXiv Detail & Related papers (2021-09-01T08:22:41Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Classifier Pool Generation based on a Two-level Diversity Approach [14.617208698215808]
This paper describes a pool generation method guided by the diversity estimated on the data complexity and decisions.
The complexity measures with high variability across the subsamples are selected for posterior pool adaptation, where an evolutionary algorithm optimize diversity in both complexity and decision spaces.
Results show significant accuracy improvements in 69.4% of the experiments when Dynamic Selection and Dynamic Ensemble Selection methods are applied.
arXiv Detail & Related papers (2020-11-03T18:41:53Z) - Object-Attribute Biclustering for Elimination of Missing Genotypes in
Ischemic Stroke Genome-Wide Data [2.0236506875465863]
Missing genotypes can affect the efficacy of machine learning approaches to identify the risk genetic variants of common diseases and traits.
The problem occurs when genotypic data are collected from different experiments with different DNA microarrays, each being characterised by its pattern of uncalled (missing) genotypes.
We use well-developed notions of object-attribute biclusters and formal concepts that correspond to dense subrelations in the binary relation.
arXiv Detail & Related papers (2020-10-22T12:27:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.