Related papers: 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs

4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs

URL: http://arxiv.org/abs/2404.18209v1
Date: Sun, 28 Apr 2024 15:04:54 GMT
Title: 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs
Authors: Minjie Wang, Quan Gan, David Wipf, Zhenkun Cai, Ning Li, Jianheng Tang, Yanlin Zhang, Zizhao Zhang, Zunyao Mao, Yakun Song, Yanbo Wang, Jiahang Li, Han Zhang, Guang Yang, Xiao Qin, Chuan Lei, Muhan Zhang, Weinan Zhang, Christos Faloutsos, Zheng Zhang,
Abstract summary: RDBs store vast amounts of rich, informative data spread across interconnected tables. The progress of predictive machine learning models falls behind advances in other domains such as computer vision or natural language processing. We explore a class of baseline models predicated on converting multi-table datasets into graphs. We assemble a diverse collection of large-scale RDB datasets and (ii) coincident predictive tasks.
Score: 67.47600679176963
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although RDBs store vast amounts of rich, informative data spread across interconnected tables, the progress of predictive machine learning models as applied to such tasks arguably falls well behind advances in other domains such as computer vision or natural language processing. This deficit stems, at least in part, from the lack of established/public RDB benchmarks as needed for training and evaluation purposes. As a result, related model development thus far often defaults to tabular approaches trained on ubiquitous single-table benchmarks, or on the relational side, graph-based alternatives such as GNNs applied to a completely different set of graph datasets devoid of tabular characteristics. To more precisely target RDBs lying at the nexus of these two complementary regimes, we explore a broad class of baseline models predicated on: (i) converting multi-table datasets into graphs using various strategies equipped with efficient subsampling, while preserving tabular characteristics; and (ii) trainable models with well-matched inductive biases that output predictions based on these input subgraphs. Then, to address the dearth of suitable public benchmarks and reduce siloed comparisons, we assemble a diverse collection of (i) large-scale RDB datasets and (ii) coincident predictive tasks. From a delivery standpoint, we operationalize the above four dimensions (4D) of exploration within a unified, scalable open-source toolbox called 4DBInfer. We conclude by presenting evaluations using 4DBInfer, the results of which highlight the importance of considering each such dimension in the design of RDB predictive models, as well as the limitations of more naive approaches such as simply joining adjacent tables. Our source code is released at https://github.com/awslabs/multi-table-benchmark .

Related papers

Rel-HNN: Split Parallel Hypergraph Neural Network for Learning on Relational Databases [3.6423651166048874]
Flattening the database poses challenges for deep learning models.<n>We propose a novel hypergraph-based framework, that we call rel-HNN.<n>We show that rel-HNN significantly outperforms existing methods in both classification and regression tasks.
arXiv Detail & Related papers (2025-07-16T18:20:45Z)
RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases [23.836665904554426]
RDB-to-graph modeling helps capture cross-table dependencies, leading to enhanced performance across diverse tasks.<n>Applying a common rule for graph modeling leads to a 10% drop in performance compared to the best-performing graph model.<n>We introduce RDB2G, the first benchmark framework for evaluating such methods.
arXiv Detail & Related papers (2025-06-02T06:34:10Z)
Joint Relational Database Generation via Graph-Conditional Diffusion Models [44.06390394789874]
Building generative models for databases (RDBs) is important for applications like privacy's data release and real datasets.<n>Most prior either focuses on single-table generation or relies on autoregressive factorizations that impose a fixed table order and generate tables sequentially.<n>We propose a fundamentally different approach: jointly modeling all tables in an RDB without imposing any order.
arXiv Detail & Related papers (2025-05-22T11:12:56Z)
Griffin: Towards a Graph-Centric Relational Database Foundation Model [37.09648739513178]
Griffin is the first foundation model attemptation designed specifically for Databases (RDBs)<n>We enhance the architecture by incorporating a cross-attention module and a novel aggregator.<n>Griffin is evaluated on large-scale, heterogeneous, and temporal graphs extracted from RDBs across various domains.
arXiv Detail & Related papers (2025-05-08T18:03:43Z)
Large Language Models as Interpolated and Extrapolated Event Predictors [10.32127659470566]
We investigate how large language models (LLMs) streamline the design of event prediction frameworks. We propose LEAP, a unified framework that leverages large language models as event predictors.
arXiv Detail & Related papers (2024-06-15T04:09:31Z)
Novel Representation Learning Technique using Graphs for Performance Analytics [0.0]
We propose a novel idea of transforming performance data into graphs to leverage the advancement of Graph Neural Network-based (GNN) techniques. In contrast to other Machine Learning application domains, such as social networks, the graph is not given; instead, we need to build it. We evaluate the effectiveness of the generated embeddings from GNNs based on how well they make even a simple feed-forward neural network perform for regression tasks.
arXiv Detail & Related papers (2024-01-19T16:34:37Z)
D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z)
Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis [50.972595036856035]
We present a code that successfully replicates results from six popular and recent graph recommendation models. We compare these graph models with traditional collaborative filtering models that historically performed well in offline evaluations. By investigating the information flow from users' neighborhoods, we aim to identify which models are influenced by intrinsic features in the dataset structure.
arXiv Detail & Related papers (2023-08-01T09:31:44Z)
TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023 [33.70333110327871]
We present TabR -- essentially, a feed-forward network with a custom k-Nearest-Neighbors-like component in the middle. On a set of public benchmarks with datasets up to several million objects, TabR demonstrates the best average performance. In addition to the much higher performance, TabR is simple and significantly more efficient.
arXiv Detail & Related papers (2023-07-26T17:58:07Z)
Leveraging Data Recasting to Enhance Tabular Reasoning [21.970920861791015]
Prior work has mostly relied on two data generation strategies. The first is human annotation, which yields linguistically diverse data but is difficult to scale. The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness.
arXiv Detail & Related papers (2022-11-23T00:04:57Z)
TabGNN: Multiplex Graph Neural Network for Tabular Data Prediction [43.35301059378836]
We propose a novel framework TabGNN based on recently popular graph neural networks (GNN) Specifically, we firstly construct a multiplex graph to model the multifaceted sample relations, and then design a multiplex graph neural network to learn enhanced representation for each sample. Experiments on eleven TDP datasets from various domains, including classification and regression ones, show that TabGNN can consistently improve the performance.
arXiv Detail & Related papers (2021-08-20T11:51:32Z)
ARM-Net: Adaptive Relation Modeling Network for Structured Data [29.94433633729326]
ARM-Net is an adaptive relation modeling network tailored for structured data and a lightweight framework ARMOR based on ARM-Net for relational data. We show that ARM-Net consistently outperforms existing models and provides more interpretable predictions for datasets.
arXiv Detail & Related papers (2021-07-05T07:37:24Z)
BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift. We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z)
Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction. We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data. Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.