Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework
- URL: http://arxiv.org/abs/2510.18825v1
- Date: Tue, 21 Oct 2025 17:22:32 GMT
- Title: Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework
- Authors: Yujie Xing, Xiao Wang, Bin Wu, Hai Huang, Chuan Shi,
- Abstract summary: We propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction.<n>This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks.<n>We introduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer with Multi-Level Masking and Dual Attention Computation.
- Score: 18.725415922303632
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graph Transformers (GTs) have emerged as a powerful paradigm for graph representation learning due to their ability to model diverse node interactions. However, existing GTs often rely on intricate architectural designs tailored to specific interactions, limiting their flexibility. To address this, we propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction. This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks. Theoretical analysis under this framework demonstrates that the probability of correct classification positively correlates with the receptive field size and label consistency, leading to a fundamental design principle: an effective attention mask should ensure both a sufficiently large receptive field and a high level of label consistency. While no single existing mask satisfies this principle across all scenarios, our analysis reveals that hierarchical masks offer complementary strengths, motivating their effective integration. Then, we introduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer with Multi-Level Masking and Dual Attention Computation. M3Dphormer incorporates three theoretically grounded hierarchical masks and employs a bi-level expert routing mechanism to adaptively integrate multi-level interaction information. To ensure scalability, we further introduce a dual attention computation scheme that dynamically switches between dense and sparse modes based on local mask sparsity. Extensive experiments across multiple benchmarks demonstrate that M3Dphormer achieves state-of-the-art performance, validating the effectiveness of our unified framework and model design.
Related papers
- ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models [70.28556518166037]
We introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation.<n>By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints.<n>Our approach ranks first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS.
arXiv Detail & Related papers (2025-12-16T05:15:07Z) - Trainable Dynamic Mask Sparse Attention [11.506985057671015]
We introduce a trainable dynamic mask sparse attention mechanism, a method that merges the advantages of both position-aware and content-aware approaches.<n>We demonstrate that the introduced dynamic mask and sparse weights do not obstruct gradients, supporting end-to-end training.
arXiv Detail & Related papers (2025-08-04T07:05:15Z) - Polyline Path Masked Attention for Vision Transformer [52.90241449955985]
Vision Transformers (ViTs) have achieved remarkable success in computer vision.<n>Mamba2 has demonstrated its significant potential in natural language processing tasks.<n>We propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2.
arXiv Detail & Related papers (2025-06-19T00:52:30Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction [8.503246256880612]
We propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction.<n>Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation.
arXiv Detail & Related papers (2025-02-17T10:53:56Z) - Towards Fine-grained Interactive Segmentation in Images and Videos [21.22536962888316]
We present an SAM2Refiner framework built upon the SAM2 backbone.<n>This architecture allows SAM2 to generate fine-grained segmentation masks for both images and videos.<n>In addition, a mask refinement module is devised by employing a multi-scale cascaded structure to fuse mask features with hierarchical representations from the encoder.
arXiv Detail & Related papers (2025-02-12T06:38:18Z) - Hyper-Transformer for Amodal Completion [82.4118011026855]
Amodal object completion is a complex task that involves predicting the invisible parts of an object based on visible segments and background information.
We introduce a novel framework named the Hyper-Transformer Amodal Network (H-TAN)
This framework utilizes a hyper transformer equipped with a dynamic convolution head to directly learn shape priors and accurately predict amodal masks.
arXiv Detail & Related papers (2024-05-30T11:11:54Z) - UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders.
We first develop an adaptive feature mask generator to account for the unique significance of nodes.
We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z) - SODAR: Segmenting Objects by DynamicallyAggregating Neighboring Mask
Representations [90.8752454643737]
Recent state-of-the-art one-stage instance segmentation model SOLO divides the input image into a grid and directly predicts per grid cell object masks with fully-convolutional networks.
We observe SOLO generates similar masks for an object at nearby grid cells, and these neighboring predictions can complement each other as some may better segment certain object part.
Motivated by the observed gap, we develop a novel learning-based aggregation method that improves upon SOLO by leveraging the rich neighboring information.
arXiv Detail & Related papers (2022-02-15T13:53:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.