Related papers: Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking

Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking

URL: http://arxiv.org/abs/2504.20900v1
Date: Tue, 29 Apr 2025 16:16:51 GMT
Title: Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking
Authors: Dayananda Herurkar, Ahmad Ali, Andreas Dengel,
Abstract summary: Existing evaluation metrics offer only partial insights, lacking a comprehensive measure of generative performance.<n>We propose three novel evaluation metrics: FAED, FPCAD, and RFIS.<n>Our results demonstrate that FAED effectively captures generative modeling issues overlooked by existing metrics.
Score: 11.03600500716845
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative models have revolutionized multiple domains, yet their application to tabular data remains underexplored. Evaluating generative models for tabular data presents unique challenges due to structural complexity, large-scale variability, and mixed data types, making it difficult to intuitively capture intricate patterns. Existing evaluation metrics offer only partial insights, lacking a comprehensive measure of generative performance. To address this limitation, we propose three novel evaluation metrics: FAED, FPCAD, and RFIS. Our extensive experimental analysis, conducted on three standard network intrusion detection datasets, compares these metrics with established evaluation methods such as Fidelity, Utility, TSTR, and TRTS. Our results demonstrate that FAED effectively captures generative modeling issues overlooked by existing metrics. While FPCAD exhibits promising performance, further refinements are necessary to enhance its reliability. Our proposed framework provides a robust and practical approach for assessing generative models in tabular data applications.

Related papers

Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z)
A Multi-Armed Bandit Approach to Online Selection and Evaluation of Generative Models [23.91197677628145]
In this work, we propose an online evaluation and selection framework to find the generative model that maximizes a standard assessment score.<n>Specifically, we develop the MAB-based selection of generative models considering the Fr'echet Distance (FD) and Inception Score (IS) metrics.<n>Our empirical results suggest the efficacy of MAB approaches for the sample-efficient evaluation and selection of deep generative models.
arXiv Detail & Related papers (2024-06-11T16:57:48Z)
Bridging Textual and Tabular Worlds for Fact Verification: A Lightweight, Attention-Based Model [34.1224836768324]
FEVEROUS is a benchmark and research initiative focused on fact extraction and verification tasks. This paper introduces a simple yet powerful model that nullifies the need for modality conversion. Our approach efficiently exploits latent connections between different data types, thereby yielding comprehensive and reliable verdict predictions.
arXiv Detail & Related papers (2024-03-26T03:54:25Z)
Retrieval Augmented Deep Anomaly Detection for Tabular Data [0.0]
Research has introduced retrieval-augmented models to address this gap. We propose a reconstruction-based approach in which a transformer model learns to reconstruct masked features of textitnormal samples. Experiments on a benchmark of 31 datasets reveal that augmenting this reconstruction-based anomaly detection method with sample-sample dependencies via retrieval modules significantly boosts performance.
arXiv Detail & Related papers (2024-01-30T14:33:18Z)
Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data [75.20035991513564]
We introduce 3S Testing, a deep generative modeling framework to facilitate model evaluation. Our experiments demonstrate that 3S Testing outperforms traditional baselines. These results raise the question of whether we need a paradigm shift away from limited real test data towards synthetic test data.
arXiv Detail & Related papers (2023-10-25T10:18:44Z)
Quantifying Overfitting: Introducing the Overfitting Index [0.0]
Overfitting is where a model exhibits superior performance on training data but falters on unseen data. This paper introduces the Overfitting Index (OI), a novel metric devised to quantitatively assess a model's tendency to overfit. Our results underscore the variable overfitting behaviors across architectures and highlight the mitigative impact of data augmentation.
arXiv Detail & Related papers (2023-08-16T21:32:57Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.<n> Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
Geometric Deep Learning for Structure-Based Drug Design: A Survey [83.87489798671155]
Structure-based drug design (SBDD) leverages the three-dimensional geometry of proteins to identify potential drug candidates. Recent advancements in geometric deep learning, which effectively integrate and process 3D geometric data, have significantly propelled the field forward.
arXiv Detail & Related papers (2023-06-20T14:21:58Z)
Beyond Individual Input for Deep Anomaly Detection on Tabular Data [0.0]
Anomaly detection is vital in many domains, such as finance, healthcare, and cybersecurity. To the best of our knowledge, this is the first work to successfully combine feature-feature and sample-sample dependencies. Our method achieves state-of-the-art performance, outperforming existing methods by 2.4% and 1.2% in terms of F1-score and AUROC, respectively.
arXiv Detail & Related papers (2023-05-24T13:13:26Z)
Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples [25.657798631897908]
Feature Likelihood Divergence provides a comprehensive trichotomic evaluation of generative models. We empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail.
arXiv Detail & Related papers (2023-02-09T04:57:27Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
Predicting Multidimensional Data via Tensor Learning [0.0]
We develop a model that retains the intrinsic multidimensional structure of the dataset. To estimate the model parameters, an Alternating Least Squares algorithm is developed. The proposed model is able to outperform benchmark models present in the forecasting literature.
arXiv Detail & Related papers (2020-02-11T11:57:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.