Related papers: How Well Does Your Tabular Generator Learn the Structure of Tabular Data?

How Well Does Your Tabular Generator Learn the Structure of Tabular Data?

URL: http://arxiv.org/abs/2503.09453v1
Date: Wed, 12 Mar 2025 14:54:58 GMT
Title: How Well Does Your Tabular Generator Learn the Structure of Tabular Data?
Authors: Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik,
Abstract summary: In this paper, we introduce TabStruct, a novel evaluation benchmark that positions structural fidelity as a core evaluation dimension.<n>We show that structural fidelity offers a task-independent, domain-agnostic evaluation dimension.
Score: 10.974400005358193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Heterogeneous tabular data poses unique challenges in generative modelling due to its fundamentally different underlying data structure compared to homogeneous modalities, such as images and text. Although previous research has sought to adapt the successes of generative modelling in homogeneous modalities to the tabular domain, defining an effective generator for tabular data remains an open problem. One major reason is that the evaluation criteria inherited from other modalities often fail to adequately assess whether tabular generative models effectively capture or utilise the unique structural information encoded in tabular data. In this paper, we carefully examine the limitations of the prevailing evaluation framework and introduce $\textbf{TabStruct}$, a novel evaluation benchmark that positions structural fidelity as a core evaluation dimension. Specifically, TabStruct evaluates the alignment of causal structures in real and synthetic data, providing a direct measure of how effectively tabular generative models learn the structure of tabular data. Through extensive experiments using generators from eight categories on seven datasets with expert-validated causal graphical structures, we show that structural fidelity offers a task-independent, domain-agnostic evaluation dimension. Our findings highlight the importance of tabular data structure and offer practical guidance for developing more effective and robust tabular generative models. Code is available at https://github.com/SilenceX12138/TabStruct.

Related papers

Improving Deep Tabular Learning [1.2891210250935148]
Tabular data remains a dominant form of real-world information but poses persistent challenges for deep learning.<n>In this work, we introduce RuleNet, a transformer-based architecture specifically designed for deep tabular learning.
arXiv Detail & Related papers (2025-09-19T18:51:14Z)
TabStruct: Measuring Structural Fidelity of Tabular Data [28.606994119562163]
We introduce a new evaluation metric, $textbfglobal utility$, which enables the assessment of structural fidelity even in the absence of ground-truth causal structures.<n>We also present the TabStruct benchmark suite, including all datasets, evaluation pipelines, and raw results.
arXiv Detail & Related papers (2025-09-15T14:08:20Z)
Generating Synthetic Relational Tabular Data via Structural Causal Models [0.0]
We develop a novel framework that generates realistic synthetic relational data including causal relationships across tables.<n>Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.
arXiv Detail & Related papers (2025-07-04T12:27:23Z)
AlphaFold Database Debiasing for Robust Inverse Folding [58.792020809180336]
We introduce a Debiasing Structure AutoEncoder (DeSAE) that learns to reconstruct native-like conformations from intentionally corrupted backbone geometries.<n>At inference, applying DeSAE to AFDB structures produces debiased structures that significantly improve inverse folding performance.
arXiv Detail & Related papers (2025-06-10T02:25:31Z)
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models [83.6013616017646]
RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
arXiv Detail & Related papers (2025-05-31T21:01:02Z)
Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation [49.898152180805454]
This paper proposes three evaluation metrics designed to assess the preservation of logical relationships.<n>We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.
arXiv Detail & Related papers (2025-02-06T13:13:26Z)
Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data [1.0621665950143144]
This paper proposes the Theme-Explanation Structure-based Table Summarization pipeline (Tabular-TX)<n>It generates summary sentences following a structured format, where the Theme Part appears as an adverbial phrase, and the Explanation Part follows as a predictive clause.<n> Experimental results demonstrate that Tabular-TX significantly outperforms conventional fine-tuning-based methods.
arXiv Detail & Related papers (2025-01-17T08:42:49Z)
TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer [14.330758748478281]
TabTreeFormer is a hybrid transformer architecture that integrates inductive biases of tree-based models.<n>We show that TabTreeFormer consistently outperforms baselines in utility, fidelity, and privacy metrics with competitive efficiency.
arXiv Detail & Related papers (2025-01-02T11:57:08Z)
A Closer Look at Deep Learning Methods on Tabular Datasets [52.50778536274327]
Tabular data is prevalent across diverse domains in machine learning.<n>Deep Neural Network (DNN)-based methods have recently demonstrated promising performance.<n>We compare 32 state-of-the-art deep and tree-based methods, evaluating their average performance across multiple criteria.
arXiv Detail & Related papers (2024-07-01T04:24:07Z)
LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets. LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets. We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z)
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark Framework [18.11940247961923]
In this paper, we introduce high-order structural causal information as natural prior knowledge. We propose multiple benchmark tasks, high-order metrics, and causal inference tasks as downstream tasks for evaluating the quality of synthetic data.
arXiv Detail & Related papers (2024-06-12T15:12:49Z)
Unifying Structured Data as Graph for Data-to-Text Pre-Training [69.96195162337793]
Data-to-text (D2T) generation aims to transform structured data into natural language text. Data-to-text pre-training has proved to be powerful in enhancing D2T generation. We propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer.
arXiv Detail & Related papers (2024-01-02T12:23:49Z)
Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective [71.45945607871715]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)<n>The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels.<n>Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.
arXiv Detail & Related papers (2023-10-31T18:03:54Z)
Leveraging Data Recasting to Enhance Tabular Reasoning [21.970920861791015]
Prior work has mostly relied on two data generation strategies. The first is human annotation, which yields linguistically diverse data but is difficult to scale. The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness.
arXiv Detail & Related papers (2022-11-23T00:04:57Z)
Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test. We train a variational inference model to predict the causal structure from observational/interventional data. Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z)
Table Structure Recognition with Conditional Attention [13.976736586808308]
Table Structure Recognition (TSR) problem aims to recognize the structure of a table and transform the unstructured tables into a structured and machine-readable format. In this study, we hypothesize that a complicated table structure can be represented by a graph whose vertices and edges represent the cells and association between cells, respectively. Experimental results show that the alignment of a cell bounding box can help improve the Micro-averaged F1 score from 0.915 to 0.963, and the Macro-average F1 score from 0.787 to 0.923.
arXiv Detail & Related papers (2022-03-08T02:44:58Z)
SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab) In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab) We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z)
Adaptive Attribute and Structure Subspace Clustering Network [49.040136530379094]
We propose a novel self-expressiveness-based subspace clustering network. We first consider an auto-encoder to represent input data samples. Then, we construct a mixed signed and symmetric structure matrix to capture the local geometric structure underlying data. We perform self-expressiveness on the constructed attribute structure and matrices to learn their affinity graphs.
arXiv Detail & Related papers (2021-09-28T14:00:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.