Data Transformation Strategies to Remove Heterogeneity
- URL: http://arxiv.org/abs/2507.12677v1
- Date: Wed, 16 Jul 2025 23:27:24 GMT
- Title: Data Transformation Strategies to Remove Heterogeneity
- Authors: Sangbong Yoo, Jaeyoung Lee, Chanyoung Yoon, Geonyeong Son, Hyein Hong, Seongbum Seo, Soobin Yim, Chanyoung Jung, Jungsoo Park, Misuk Kim, Yun Jang,
- Abstract summary: Data heterogeneity is a prevalent issue, stemming from various conflicting factors, making its utilization complex.<n>Current methodologies primarily address conflicts related to data structures and schemas, often overlooking the pivotal role played by data transformation.<n>This survey explores the intricacies of data heterogeneity and its underlying sources.
- Score: 15.025447093605615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data heterogeneity is a prevalent issue, stemming from various conflicting factors, making its utilization complex. This uncertainty, particularly resulting from disparities in data formats, frequently necessitates the involvement of experts to find resolutions. Current methodologies primarily address conflicts related to data structures and schemas, often overlooking the pivotal role played by data transformation. As the utilization of artificial intelligence (AI) continues to expand, there is a growing demand for a more streamlined data preparation process, and data transformation becomes paramount. It customizes training data to enhance AI learning efficiency and adapts input formats to suit diverse AI models. Selecting an appropriate transformation technique is paramount in preserving crucial data details. Despite the widespread integration of AI across various industries, comprehensive reviews concerning contemporary data transformation approaches are scarce. This survey explores the intricacies of data heterogeneity and its underlying sources. It systematically categorizes and presents strategies to address heterogeneity stemming from differences in data formats, shedding light on the inherent challenges associated with each strategy.
Related papers
- Anomaly Detection and Generation with Diffusion Models: A Survey [51.61574868316922]
Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing.<n>Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest.<n>This survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
arXiv Detail & Related papers (2025-06-11T03:29:18Z) - Addressing Heterogeneity in Federated Learning: Challenges and Solutions for a Shared Production Environment [1.2499537119440245]
Federated learning (FL) has emerged as a promising approach to training machine learning models across decentralized data sources.
This paper provides a comprehensive overview of data heterogeneity in FL within the context of manufacturing.
We discuss the impact of these types of heterogeneity on model training and review current methodologies for mitigating their adverse effects.
arXiv Detail & Related papers (2024-08-18T17:49:44Z) - Learning Divergence Fields for Shift-Robust Graph Representations [73.11818515795761]
In this work, we propose a geometric diffusion model with learnable divergence fields for the challenging problem with interdependent data.
We derive a new learning objective through causal inference, which can guide the model to learn generalizable patterns of interdependence that are insensitive across domains.
arXiv Detail & Related papers (2024-06-07T14:29:21Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.<n>Existing literature surveys only focus on a certain type of specific modality data.<n>We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Machine Learning Techniques for Sensor-based Human Activity Recognition with Data Heterogeneity -- A Review [0.8142555609235358]
Sensor-based Human Activity Recognition (HAR) is crucial in ubiquitous computing.
HAR confronts challenges, particularly in data distribution assumptions.
This review investigates how machine learning addresses data heterogeneity in HAR.
arXiv Detail & Related papers (2024-03-12T22:22:14Z) - Automatic Data Augmentation via Invariance-Constrained Learning [94.27081585149836]
Underlying data structures are often exploited to improve the solution of learning tasks.
Data augmentation induces these symmetries during training by applying multiple transformations to the input data.
This work tackles these issues by automatically adapting the data augmentation while solving the learning task.
arXiv Detail & Related papers (2022-09-29T18:11:01Z) - Deep invariant networks with differentiable augmentation layers [87.22033101185201]
Methods for learning data augmentation policies require held-out data and are based on bilevel optimization problems.
We show that our approach is easier and faster to train than modern automatic data augmentation techniques.
arXiv Detail & Related papers (2022-02-04T14:12:31Z) - Non-IID data and Continual Learning processes in Federated Learning: A
long road ahead [58.720142291102135]
Federated Learning is a novel framework that allows multiple devices or institutions to train a machine learning model collaboratively while preserving their data private.
In this work, we formally classify data statistical heterogeneity and review the most remarkable learning strategies that are able to face it.
At the same time, we introduce approaches from other machine learning frameworks, such as Continual Learning, that also deal with data heterogeneity and could be easily adapted to the Federated Learning settings.
arXiv Detail & Related papers (2021-11-26T09:57:11Z) - Variational Selective Autoencoder: Learning from Partially-Observed
Heterogeneous Data [45.23338389559936]
We propose the variational selective autoencoder (VSAE) to learn representations from partially-observed heterogeneous data.
VSAE learns the latent dependencies in heterogeneous data by modeling the joint distribution of observed data, unobserved data, and the imputation mask.
It results in a unified model for various downstream tasks including data generation and imputation.
arXiv Detail & Related papers (2021-02-25T04:39:13Z) - Propositionalization and Embeddings: Two Sides of the Same Coin [0.0]
This paper outlines some of the modern data processing techniques used in relational learning.
It focuses on the propositionalization and embedding data transformation approaches.
We present two efficient implementations of the unifying methodology.
arXiv Detail & Related papers (2020-06-08T08:33:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.