Reinforcement Learning-based Feature Generation Algorithm for Scientific Data
- URL: http://arxiv.org/abs/2507.03498v2
- Date: Wed, 09 Jul 2025 11:30:58 GMT
- Title: Reinforcement Learning-based Feature Generation Algorithm for Scientific Data
- Authors: Meng Xiao, Junfeng Zhou, Yuanchun Zhou,
- Abstract summary: Feature generation (FG) aims to enhance the prediction potential of original data by constructing high-order feature combinations and removing redundant features.<n>This paper proposes the Multi-agent Feature Generation (MAFG) framework. Specifically, multi-agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations ex-hibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies.
- Score: 6.449769135199048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feature generation (FG) aims to enhance the prediction potential of original data by constructing high-order feature combinations and removing redundant features. It is a key preprocessing step for tabular scientific data to improve downstream machine-learning model performance. Traditional methods face the following two challenges when dealing with the feature generation of scientific data: First, the effective construction of high-order feature combinations in scientific data necessitates profound and extensive domain-specific expertise. Secondly, as the order of feature combinations increases, the search space expands exponentially, imposing prohibitive human labor consumption. Advancements in the Data-Centric Artificial Intelligence (DCAI) paradigm have opened novel avenues for automating feature generation processes. Inspired by that, this paper revisits the conventional feature generation workflow and proposes the Multi-agent Feature Generation (MAFG) framework. Specifically, in the iterative exploration stage, multi-agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations ex-hibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies. Upon completing the exploration phase, MAFG integrates the large language models (LLMs) to interpreta-tively evaluate the generated features of each significant model performance breakthrough. Experimental results and case studies consistently demonstrate that the MAFG framework effectively automates the feature generation process and significantly enhances various downstream scientific data mining tasks.
Related papers
- Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning [10.317489871533565]
In this paper, we introduce HRLFS, a reinforcement learning-based subspace exploration strategy for complex datasets.<n>We show that HRLFS improves the downstream machine learning performance with iterative feature subspace exploration.<n>We also show that HRLFS accelerates total run time by reducing the number of agents involved.
arXiv Detail & Related papers (2025-04-24T08:16:36Z) - Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models generate target documents directly from a query.<n>We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - Generative Fuzzy System for Sequence Generation [16.20988290308979]
We introduce the fuzzy system, a classical modeling method that combines data and knowledge-driven mechanisms, to generative tasks.
We propose an end-to-end GenFS-based model for sequence generation, called FuzzyS2S.
A series of experimental studies were conducted on 12 datasets, covering three distinct categories of generative tasks.
arXiv Detail & Related papers (2024-11-21T06:03:25Z) - Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z) - A Simple Background Augmentation Method for Object Detection with Diffusion Model [53.32935683257045]
In computer vision, it is well-known that a lack of data diversity will impair model performance.
We propose a simple yet effective data augmentation approach by leveraging advancements in generative models.
Background augmentation, in particular, significantly improves the models' robustness and generalization capabilities.
arXiv Detail & Related papers (2024-08-01T07:40:00Z) - Evolutionary Large Language Model for Automated Feature Transformation [44.64296052383581]
We propose an evolutionary Large Language Model (LLM) framework for automated feature transformation.<n>This framework consists of two parts: 1) constructing a multi-population database through an RL data collector, and 2) utilizing the ability of Large Language Model (LLM) in sequence understanding.<n>We empirically demonstrate the effectiveness and generality of our proposed method.
arXiv Detail & Related papers (2024-05-25T12:27:21Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data.
Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance.
There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z) - Traceable Automatic Feature Transformation via Cascading Actor-Critic
Agents [25.139229855367088]
Feature transformation is an essential task to boost the effectiveness and interpretability of machine learning (ML)
We formulate the feature transformation task as an iterative, nested process of feature generation and selection.
We show 24.7% improvements in F1 scores compared with SOTAs and robustness in high-dimensional data.
arXiv Detail & Related papers (2022-12-27T08:20:19Z) - Audacity of huge: overcoming challenges of data scarcity and data
quality for machine learning in computational materials discovery [1.0036312061637764]
Machine learning (ML)-accelerated discovery requires large amounts of high-fidelity data to reveal predictive structure-property relationships.
For many properties of interest in materials discovery, the challenging nature and high cost of data generation has resulted in a data landscape that is scarcely populated and of dubious quality.
In the absence of manual curation, increasingly sophisticated natural language processing and automated image analysis are making it possible to learn structure-property relationships from the literature.
arXiv Detail & Related papers (2021-11-02T21:43:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.