Related papers: Revisiting Over-smoothing in BERT from the Perspective of Graph

Revisiting Over-smoothing in BERT from the Perspective of Graph

URL: http://arxiv.org/abs/2202.08625v1
Date: Thu, 17 Feb 2022 12:20:52 GMT
Title: Revisiting Over-smoothing in BERT from the Perspective of Graph
Authors: Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M.S. Lee, James T. Kwok
Abstract summary: Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
Score: 111.24636158179908
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph. Based on the above connection, we provide some theoretical analysis and find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. Specifically, if the standard deviation of layer normalization is sufficiently large, the output of Transformer stacks will converge to a specific low-rank subspace and result in over-smoothing. To alleviate the over-smoothing problem, we consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse. Extensive experiment results on various data sets illustrate the effect of our fusion method.

Related papers

A Unified Perspective on the Dynamics of Deep Transformers [24.094975798576783]
We study the evolution of data anisotropy through a deep Transformer. We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
arXiv Detail & Related papers (2025-01-30T13:04:54Z)
A Theory for Compressibility of Graph Transformers for Transductive Learning [6.298115235439078]
Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks. All train/test/validation samples are present during training, making them more akin to a semi-supervised task. We establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed.
arXiv Detail & Related papers (2024-11-20T04:20:17Z)
Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers [3.686808512438363]
This paper examines signal propagation in textitattention-only transformers from a random matrix perspective. We show that a textitspectral gap between the two largest singular values of the attention matrix causes rank collapse in width. We propose a novel, yet simple, practical solution to resolve rank collapse in width by removing the spectral gap.
arXiv Detail & Related papers (2024-10-10T10:34:18Z)
FANFOLD: Graph Normalizing Flows-driven Asymmetric Network for Unsupervised Graph-Level Anomaly Detection [18.758250338590297]
Unsupervised graph-level anomaly detection (UGAD) has attracted increasing interest due to its widespread application. We propose a Graph Normalizing Flows-driven Asymmetric Network For Unsupervised Graph-Level Anomaly Detection (FANFOLD)
arXiv Detail & Related papers (2024-06-29T09:49:16Z)
Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs [30.003409099607204]
We provide a formal and precise characterization of (linearized) graph neural networks (GNNs) with residual connections and normalization layers. We show that the centering step of a normalization layer alters the graph signal in message-passing in such a way that relevant information can become harder to extract. We introduce a novel, principled normalization layer called GraphNormv2 in which the centering step is learned such that it does not distort the original graph signal in an undesirable way.
arXiv Detail & Related papers (2024-06-05T06:53:16Z)
What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks. This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z)
AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model [59.08735812631131]
Anomaly inspection plays an important role in industrial manufacture. Existing anomaly inspection methods are limited in their performance due to insufficient anomaly data. We propose AnomalyDiffusion, a novel diffusion-based few-shot anomaly generation model.
arXiv Detail & Related papers (2023-12-10T05:13:40Z)
Advective Diffusion Transformers for Topological Generalization in Graph Learning [69.2894350228753]
We show how graph diffusion equations extrapolate and generalize in the presence of varying graph topologies. We propose a novel graph encoder backbone, Advective Diffusion Transformer (ADiT), inspired by advective graph diffusion equations.
arXiv Detail & Related papers (2023-10-10T08:40:47Z)
DAGAD: Data Augmentation for Graph Anomaly Detection [57.92471847260541]
This paper devises a novel Data Augmentation-based Graph Anomaly Detection (DAGAD) framework for attributed graphs. A series of experiments on three datasets prove that DAGAD outperforms ten state-of-the-art baseline detectors concerning various mostly-used metrics.
arXiv Detail & Related papers (2022-10-18T11:28:21Z)
Multilayer Graph Clustering with Optimized Node Embedding [70.1053472751897]
multilayer graph clustering aims at dividing the graph nodes into categories or communities. We propose a clustering-friendly embedding of the layers of a given multilayer graph. Experiments show that our method leads to a significant improvement.
arXiv Detail & Related papers (2021-03-30T17:36:40Z)
Multilayer Clustered Graph Learning [66.94201299553336]
We use contrastive loss as a data fidelity term, in order to properly aggregate the observed layers into a representative graph. Experiments show that our method leads to a clustered clusters w.r.t. We learn a clustering algorithm for solving clustering problems.
arXiv Detail & Related papers (2020-10-29T09:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.