Related papers: Tab-Shapley: Identifying Top-k Tabular Data Quality Insights

Tab-Shapley: Identifying Top-k Tabular Data Quality Insights

URL: http://arxiv.org/abs/2501.06685v1
Date: Sun, 12 Jan 2025 02:24:55 GMT
Title: Tab-Shapley: Identifying Top-k Tabular Data Quality Insights
Authors: Manisha Padala, Lokesh Nagalapatti, Atharv Tyagi, Ramasuri Narayanam, Shiv Kumar Saini,
Abstract summary: We introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature.<n>While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient.
Score: 7.666573679741346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.

Related papers

RFOD: Random Forest-based Outlier Detection for Tabular Data [12.469208664014472]
Outlier detection is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare.<n>textsfRFOD reframes anomaly detection as a feature-wise conditional reconstruction problem.<n>textsfRFOD consistently outperforms state-of-the-art baselines in detection accuracy.
arXiv Detail & Related papers (2025-10-09T19:02:12Z)
Deep Context-Conditioned Anomaly Detection for Tabular Data [9.58464841713335]
Anomaly detection is critical in domains such as cybersecurity and finance.<n>In this paper, we present a context-conditional anomaly detection framework.<n>Our approach automatically identifies context features and models the conditional data distribution.
arXiv Detail & Related papers (2025-09-10T22:01:11Z)
Disentangling Tabular Data Towards Better One-Class Anomaly Detection [24.549797910707092]
Tabular anomaly detection under the one-class classification setting poses a significant challenge. Capturing the intrinsic correlation among attributes within normal samples presents one promising method for learning the concept. Our method substantially outperforms the state-of-the-art methods and leads to an average performance improvement of 6.1% on AUC-PR and 2.1% on AUC-ROC.
arXiv Detail & Related papers (2024-11-12T06:24:11Z)
End-to-end guarantees for indirect data-driven control of bilinear systems with finite stochastic data [0.0468732641979009]
We propose an end-to-end algorithm for indirect data-driven control for bilinear systems with stability guarantees. By means of an extensive numerical study we showcase the interplay between the controller design and the derived identification error bounds.
arXiv Detail & Related papers (2024-09-26T16:19:49Z)
AdapTable: Test-Time Adaptation for Tabular Data via Shift-Aware Uncertainty Calibrator and Label Distribution Handler [29.395855812763617]
We propose AdapTable, a framework for adapting machine learning models to target data without accessing source data. AdapTable operates in two stages: 1) calibrating model predictions using a shift-aware uncertainty calibrator, and 2) adjusting these predictions to match the target label distribution with a label distribution handler. Our results demonstrate AdapTable's ability to handle various real-world distribution shifts, achieving up to a 16% improvement on the dataset.
arXiv Detail & Related papers (2024-07-15T15:02:53Z)
Anomaly Detection by Context Contrasting [57.695202846009714]
Anomaly detection focuses on identifying samples that deviate from the norm. Recent advances in self-supervised learning have shown great promise in this regard. We propose Con$$, which learns through context augmentations.
arXiv Detail & Related papers (2024-05-29T07:59:06Z)
Approximating Counterfactual Bounds while Fusing Observational, Biased and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies. We show that the likelihood of the available data has no local maxima. We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z)
Learning Representations without Compositional Assumptions [79.12273403390311]
We propose a data-driven approach that learns feature set dependencies by representing feature sets as graph nodes and their relationships as learnable edges. We also introduce LEGATO, a novel hierarchical graph autoencoder that learns a smaller, latent graph to aggregate information from multiple views dynamically.
arXiv Detail & Related papers (2023-05-31T10:36:10Z)
Beyond Individual Input for Deep Anomaly Detection on Tabular Data [0.0]
Anomaly detection is vital in many domains, such as finance, healthcare, and cybersecurity. To the best of our knowledge, this is the first work to successfully combine feature-feature and sample-sample dependencies. Our method achieves state-of-the-art performance, outperforming existing methods by 2.4% and 1.2% in terms of F1-score and AUROC, respectively.
arXiv Detail & Related papers (2023-05-24T13:13:26Z)
Framing Algorithmic Recourse for Anomaly Detection [18.347886926848563]
We present an approach -- Context preserving Algorithmic Recourse for Anomalies in Tabular data (CARAT) CARAT uses a transformer based encoder-decoder model to explain an anomaly by finding features with low likelihood. Semantically coherent counterfactuals are generated by modifying the highlighted features, using the overall context of features in the anomalous instance(s)
arXiv Detail & Related papers (2022-06-29T03:30:51Z)
SLA$^2$P: Self-supervised Anomaly Detection with Adversarial Perturbation [77.71161225100927]
Anomaly detection is a fundamental yet challenging problem in machine learning. We propose a novel and powerful framework, dubbed as SLA$2$P, for unsupervised anomaly detection.
arXiv Detail & Related papers (2021-11-25T03:53:43Z)
Toward Deep Supervised Anomaly Detection: Reinforcement Learning from Partially Labeled Anomaly Data [150.9270911031327]
We consider the problem of anomaly detection with a small set of partially labeled anomaly examples and a large-scale unlabeled dataset. Existing related methods either exclusively fit the limited anomaly examples that typically do not span the entire set of anomalies, or proceed with unsupervised learning from the unlabeled data. We propose here instead a deep reinforcement learning-based approach that enables an end-to-end optimization of the detection of both labeled and unlabeled anomalies.
arXiv Detail & Related papers (2020-09-15T03:05:39Z)
Accounting for Unobserved Confounding in Domain Generalization [107.0464488046289]
This paper investigates the problem of learning robust, generalizable prediction models from a combination of datasets. Part of the challenge of learning robust models lies in the influence of unobserved confounders. We demonstrate the empirical performance of our approach on healthcare data from different modalities.
arXiv Detail & Related papers (2020-07-21T08:18:06Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.