Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation
- URL: http://arxiv.org/abs/2501.10555v1
- Date: Fri, 17 Jan 2025 21:05:09 GMT
- Title: Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation
- Authors: Dongjie Wang, Yanyong Huang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Tao Zhe, Kunpeng Liu, Meng Xiao, Pengfei Wang, Pengyang Wang, Hui Xiong, Yanjie Fu,
- Abstract summary: This survey examines the key aspects of data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement.
We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns.
- Score: 37.43210238341124
- License:
- Abstract: Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns. This survey offers a comprehensive overview of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.
Related papers
- A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective [23.25829868360603]
Tabular data is one of the most widely used data formats across various domains such as bioinformatics, healthcare, and marketing.
This survey explores reinforcement learning (RL) and generative approaches for feature selection and feature generation as fundamental techniques for refining data spaces.
We summarize the existing challenges and discuss future research directions, aiming to provide insights that drive continued innovation in this field.
arXiv Detail & Related papers (2025-02-12T22:34:50Z) - Deep Learning within Tabular Data: Foundations, Challenges, Advances and Future Directions [4.795774784702568]
Tabular data remains one of the most prevalent data types across a wide range of real-world applications.
Yet effective representation learning for this domain poses unique challenges due to its irregular patterns, heterogeneous feature distributions, and complex inter-column dependencies.
arXiv Detail & Related papers (2025-01-07T05:23:36Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Label-Free Topic-Focused Summarization Using Query Augmentation [2.127049691404299]
This study introduces a novel method, Augmented-Query Summarization (AQS), for topic-focused summarization without the need for extensive labelled datasets.
Our method demonstrates the ability to generate relevant and accurate summaries, showing its potential as a cost-effective solution in data-rich environments.
This innovation paves the way for broader application and accessibility in the field of topic-focused summarization technology.
arXiv Detail & Related papers (2024-04-25T08:39:10Z) - Data Augmentation in Human-Centric Vision [54.97327269866757]
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks.
It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection.
Our work categorizes data augmentation methods into two main types: data generation and data perturbation.
arXiv Detail & Related papers (2024-03-13T16:05:18Z) - Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey [17.19337964440007]
There is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain.
This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized.
It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field.
arXiv Detail & Related papers (2024-02-27T23:59:01Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - Towards an Integrated Platform for Big Data Analysis [4.5257812998381315]
This paper presents the vision of an integrated plat-form for big data analysis that combines all these aspects.
Main benefits of this approach are an enhanced scalability of the whole platform, a better parameterization of algorithms, and an improved usability during the end-to-end data analysis process.
arXiv Detail & Related papers (2020-04-27T03:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.