Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information
- URL: http://arxiv.org/abs/2510.25542v1
- Date: Wed, 29 Oct 2025 14:07:12 GMT
- Title: Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information
- Authors: Yuan Cheng, Yu Huang, Zhe Xiong, Yingbin Liang, Vincent Y. F. Tan,
- Abstract summary: transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs.<n>We introduce a novel information-theoretic metric: the kernel-guided mutual information (KG-MI) based on the $f$-divergence.<n>We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via a gradient ascent converges to the global optimum time.
- Score: 91.66597637613263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) -- which involve multiple parents per node -- remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the $f$-divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the $f$-divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings.
Related papers
- OWLEYE: Zero-Shot Learner for Cross-Domain Graph Data Anomaly Detection [48.77471686671269]
OWLEYE is a novel framework that learns transferable patterns of normal behavior from multiple graphs.<n>We show that OWLEYE achieves superior performance and generalizability compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-01-27T02:08:18Z) - Modality as Heterogeneity: Node Splitting and Graph Rewiring for Multimodal Graph Learning [10.65673380743972]
We propose NSG (Node Splitting Graph)-MoE, a multimodal graph learning framework that integrates a node-splitting and graph-rewiring mechanism.<n>It explicitly decomposes each node into modality-specific components and assigns relation-aware experts to process heterogeneous message flows.<n>Experiments on three multimodal benchmarks demonstrate that NSG-MoE consistently surpasses strong baselines.
arXiv Detail & Related papers (2026-01-20T13:38:50Z) - Self-Reinforced Graph Contrastive Learning [7.49025068464945]
We propose SRGCL (Self-Reinforced Graph Contrastive Learning), a novel framework to dynamically evaluate and select high-quality positive pairs.<n>In experiments on diverse graph-level classification tasks, SRGCL consistently outperforms state-of-the-art GCL methods.
arXiv Detail & Related papers (2025-05-19T18:45:54Z) - Balanced Multi-Relational Graph Clustering [5.531383184058319]
Multi-relational graph clustering has demonstrated remarkable success in uncovering underlying patterns in complex networks.
Our empirical study finds the pervasive presence of imbalance in real-world graphs, which is in principle contradictory to the motivation of alignment.
We propose Balanced Multi-Relational Graph Clustering (BMGC), comprising unsupervised dominant view mining and dual signals guided representation learning.
arXiv Detail & Related papers (2024-07-23T22:11:13Z) - Introducing Diminutive Causal Structure into Graph Representation Learning [19.132025125620274]
We introduce a novel method that enables Graph Neural Networks (GNNs) to glean insights from specialized diminutive causal structures.
Our method specifically extracts causal knowledge from the model representation of these diminutive causal structures.
arXiv Detail & Related papers (2024-06-13T00:18:20Z) - What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks.
This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z) - Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning [80.44084021062105]
We propose a novel latent partial causal model for multimodal data, featuring two latent coupled variables, connected by an undirected edge, to represent the transfer of knowledge across modalities.<n>Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by multimodal contrastive learning correspond to the latent coupled variables up to a trivial transformation.<n>Experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets.
arXiv Detail & Related papers (2024-02-09T07:18:06Z) - Directed Acyclic Graph Structure Learning from Dynamic Graphs [44.21230819336437]
Estimating the structure of directed acyclic graphs (DAGs) of features (variables) plays a vital role in revealing the latent data generation process.
We study the learning problem of node feature generation mechanism on such ubiquitous dynamic graph data.
arXiv Detail & Related papers (2022-11-30T14:22:01Z) - Discovering the Representation Bottleneck of Graph Neural Networks from
Multi-order Interactions [51.597480162777074]
Graph neural networks (GNNs) rely on the message passing paradigm to propagate node features and build interactions.
Recent works point out that different graph learning tasks require different ranges of interactions between nodes.
We study two common graph construction methods in scientific domains, i.e., emphK-nearest neighbor (KNN) graphs and emphfully-connected (FC) graphs.
arXiv Detail & Related papers (2022-05-15T11:38:14Z) - Graph Representation Learning via Graphical Mutual Information
Maximization [86.32278001019854]
We propose a novel concept, Graphical Mutual Information (GMI), to measure the correlation between input graphs and high-level hidden representations.
We develop an unsupervised learning model trained by maximizing GMI between the input and output of a graph neural encoder.
arXiv Detail & Related papers (2020-02-04T08:33:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.