Related papers: Handling Incomplete Heterogeneous Data using a Data-Dependent Kernel

Handling Incomplete Heterogeneous Data using a Data-Dependent Kernel

URL: http://arxiv.org/abs/2501.04300v1
Date: Wed, 08 Jan 2025 06:18:32 GMT
Title: Handling Incomplete Heterogeneous Data using a Data-Dependent Kernel
Authors: Youran Zhou, Mohamed Reda Bouadjenek, Jonathan Wells, Sunil Aryal,
Abstract summary: This paper presents a novel approach to handling missing values using the Mass Similarity Kernel (PMK), a data-dependent kernel.<n>It unifies the representation of diverse data types by capturing more meaningful pairwise similarities.<n>Across both classification and clustering tasks, our approach consistently outperformed existing techniques.
Score: 1.945017258192898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Handling incomplete data in real-world applications is a critical challenge due to two key limitations of existing methods: (i) they are primarily designed for numeric data and struggle with categorical or heterogeneous/mixed datasets; (ii) they assume that data is missing completely at random, which is often not the case in practice -- in reality, data is missing in patterns, leading to biased results if these patterns are not accounted for. To address these two limitations, this paper presents a novel approach to handling missing values using the Probability Mass Similarity Kernel (PMK), a data-dependent kernel, which does not make any assumptions about data types and missing mechanisms. It eliminates the need for prior knowledge or extensive pre-processing steps and instead leverages the distribution of observed data. Our method unifies the representation of diverse data types by capturing more meaningful pairwise similarities and enhancing downstream performance. We evaluated our approach across over 10 datasets with numerical-only, categorical-only, and mixed features under different missing mechanisms and rates. Across both classification and clustering tasks, our approach consistently outperformed existing techniques, demonstrating its robustness and effectiveness in managing incomplete heterogeneous data.

Related papers

Approaching Metaheuristic Deep Learning Combos for Automated Data Mining [0.5419570023862531]
This work proposes a means of combining meta-heuristic methods with conventional classifiers and neural networks in order to perform automated data mining. Experiments on the MNIST dataset for handwritten digit recognition were performed. It was empirically observed that using a ground truth labeled dataset's validation accuracy is inadequate for correcting labels of other previously unseen data instances.
arXiv Detail & Related papers (2024-10-16T10:28:22Z)
DAGnosis: Localized Identification of Data Inconsistencies using Structures [73.39285449012255]
Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. We use directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions.
arXiv Detail & Related papers (2024-02-26T11:29:16Z)
Binary Quantification and Dataset Shift: An Experimental Investigation [54.14283123210872]
Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data. The relationship between quantification and other types of dataset shift remains, by and large, unexplored. We propose a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift.
arXiv Detail & Related papers (2023-10-06T20:11:27Z)
Conditional Feature Importance for Mixed Data [1.6114012813668934]
We develop a conditional predictive impact (CPI) framework with knockoff sampling. We show that our proposed workflow controls type I error, achieves high power and is in line with results given by other conditional FI measures. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.
arXiv Detail & Related papers (2022-10-06T16:52:38Z)
Rethinking Data Heterogeneity in Federated Learning: Introducing a New Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants. Our observations are intuitive. Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z)
Leachable Component Clustering [10.377914682543903]
In this work, a novel approach to clustering of incomplete data, termed leachable component clustering, is proposed. The proposed method handles data imputation with Bayes alignment, and collects the lost patterns in theory. Experiments on several artificial incomplete data sets demonstrate that, the proposed method is able to present superior performance compared with other state-of-the-art algorithms.
arXiv Detail & Related papers (2022-08-28T13:13:17Z)
MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data. MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism. We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z)
Greedy structure learning from data that contains systematic missing values [13.088541054366527]
Learning from data that contain missing values represents a common phenomenon in many domains. Relatively few Bayesian Network structure learning algorithms account for missing data. This paper describes three variants of greedy search structure learning that utilise pairwise deletion and inverse probability weighting.
arXiv Detail & Related papers (2021-07-09T02:56:44Z)
Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare. In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples. To tackle this problem, we build a robust one-class classification framework via data refinement. We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z)
Deep Generative Pattern-Set Mixture Models for Nonignorable Missingness [0.0]
We propose a variational autoencoder architecture to model both ignorable and nonignorable missing data. Our model explicitly learns to cluster the missing data into missingness pattern sets based on the observed data and missingness masks. Our setup trades off the characteristics of ignorable and nonignorable missingness and can thus be applied to data of both types.
arXiv Detail & Related papers (2021-03-05T08:21:35Z)
Learning while Respecting Privacy and Robustness to Distributional Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model. The objective is to endow the trained model with robustness against adversarially manipulated input data. Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
Clustering and Classification with Non-Existence Attributes: A Sentenced Discrepancy Measure Based Technique [0.0]
Clustering approaches cannot be applied directly to such data unless pre-processing by techniques like imputation or marginalization. We have overcome this drawback by utilizing a Sentenced Discrepancy Measure which we refer to as the Attribute Weighted Penalty based Discrepancy (AWPD) This technique is designed to trace invaluable data to: directly apply our method on the datasets which have Non-Existence attributes and establish a method for detecting unstructured Non-Existence attributes with the best accuracy rate and minimum cost.
arXiv Detail & Related papers (2020-02-24T17:56:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.