WikiDBGraph: Large-Scale Database Graph of Wikidata for Collaborative Learning
- URL: http://arxiv.org/abs/2505.16635v1
- Date: Thu, 22 May 2025 13:07:06 GMT
- Title: WikiDBGraph: Large-Scale Database Graph of Wikidata for Collaborative Learning
- Authors: Zhaomin Wu, Ziyang Wang, Bingsheng He,
- Abstract summary: We introduce WikiDBGraph, a large-scale graph of 100,000 real-world databases from WikiData.<n>It identifies both instance- and feature-overlapped databases.<n>Experiments on these newly identified databases confirm that collaborative learning yields superior performance.
- Score: 33.80292133537436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tabular data, ubiquitous and rich in informational value, is an increasing focus for deep representation learning, yet progress is hindered by studies centered on single tables or isolated databases, which limits model capabilities due to data scale. While collaborative learning approaches such as federated learning, transfer learning, split learning, and tabular foundation models aim to learn from multiple correlated databases, they are challenged by a scarcity of real-world interconnected tabular resources. Current data lakes and corpora largely consist of isolated databases lacking defined inter-database correlations. To overcome this, we introduce WikiDBGraph, a large-scale graph of 100,000 real-world tabular databases from WikiData, interconnected by 17 million edges and characterized by 13 node and 12 edge properties derived from its database schema and data distribution. WikiDBGraph's weighted edges identify both instance- and feature-overlapped databases. Experiments on these newly identified databases confirm that collaborative learning yields superior performance, thereby offering considerable promise for structured foundation model training while also exposing key challenges and future directions for learning from interconnected tabular data.
Related papers
- Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures [50.46688111973999]
Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data.<n>We present a new blueprint that enables end-to-end representation of'relational entity graphs' without traditional engineering feature.<n>We discuss key challenges including large-scale multi-table integration and the complexities of modeling temporal dynamics and heterogeneous data.
arXiv Detail & Related papers (2025-06-19T23:51:38Z) - RelGNN: Composite Message Passing for Relational Deep Learning [56.48834369525997]
We introduce RelGNN, a novel GNN framework specifically designed to leverage the unique structural characteristics of the graphs built from relational databases.<n>RelGNN is evaluated on 30 diverse real-world tasks from Relbench (Fey et al., 2024), and achieves state-of-the-art performance on the vast majority tasks, with improvements of up to 25%.
arXiv Detail & Related papers (2025-02-10T18:58:40Z) - Federated Neural Graph Databases [53.03085605769093]
We propose Federated Neural Graph Database (FedNGDB), a novel framework that enables reasoning over multi-source graph-based data while preserving privacy.
Unlike existing methods, FedNGDB can handle complex graph structures and relationships, making it suitable for various downstream tasks.
arXiv Detail & Related papers (2024-02-22T14:57:44Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - GFS: Graph-based Feature Synthesis for Prediction over Relational
Databases [39.975491511390985]
We propose a novel framework called Graph-based Feature Synthesis (GFS)
GFS formulates relational database as a heterogeneous graph database.
In an experiment over four real-world multi-table relational databases, GFS outperforms previous methods designed for relational databases.
arXiv Detail & Related papers (2023-12-04T16:54:40Z) - Learning Representations without Compositional Assumptions [79.12273403390311]
We propose a data-driven approach that learns feature set dependencies by representing feature sets as graph nodes and their relationships as learnable edges.
We also introduce LEGATO, a novel hierarchical graph autoencoder that learns a smaller, latent graph to aggregate information from multiple views dynamically.
arXiv Detail & Related papers (2023-05-31T10:36:10Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z) - On Embeddings in Relational Databases [11.52782249184251]
We address the problem of learning a distributed representation of entities in a relational database using a low-dimensional embedding.
Recent methods for learning embedding constitute of a naive approach to consider complete denormalization of the database by relationalizing the full join of all tables and representing as a knowledge graph.
In this paper we demonstrate; a better methodology for learning representations by exploiting the underlying semantics of columns in a table while using the relation joins and the latent inter-row relationships.
arXiv Detail & Related papers (2020-05-13T17:21:27Z) - Siamese Graph Neural Networks for Data Integration [11.41207739004894]
We propose a general approach to modeling and integrating entities from structured data, such as relational databases, as well as unstructured sources, such as free text from news articles.
Our approach is designed to explicitly model and leverage relations between entities, thereby using all available information and preserving as much context as possible.
We evaluate our method on the task of integrating data about business entities, and we demonstrate that it outperforms standard rule-based systems, as well as other deep learning approaches that do not use graph-based representations.
arXiv Detail & Related papers (2020-01-17T21:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.