Revisiting Over-smoothing in BERT from the Perspective of Graph
- URL: http://arxiv.org/abs/2202.08625v1
- Date: Thu, 17 Feb 2022 12:20:52 GMT
- Title: Revisiting Over-smoothing in BERT from the Perspective of Graph
- Authors: Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng
Kong, Stephen M.S. Lee, James T. Kwok
- Abstract summary: Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
- Score: 111.24636158179908
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently over-smoothing phenomenon of Transformer-based models is observed in
both vision and language fields. However, no existing work has delved deeper to
further investigate the main cause of this phenomenon. In this work, we make
the attempt to analyze the over-smoothing problem from the perspective of
graph, where such problem was first discovered and explored. Intuitively, the
self-attention matrix can be seen as a normalized adjacent matrix of a
corresponding graph. Based on the above connection, we provide some theoretical
analysis and find that layer normalization plays a key role in the
over-smoothing issue of Transformer-based models. Specifically, if the standard
deviation of layer normalization is sufficiently large, the output of
Transformer stacks will converge to a specific low-rank subspace and result in
over-smoothing. To alleviate the over-smoothing problem, we consider
hierarchical fusion strategies, which combine the representations from
different layers adaptively to make the output more diverse. Extensive
experiment results on various data sets illustrate the effect of our fusion
method.
Related papers
- Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers [3.686808512438363]
This paper examines signal propagation in textitattention-only transformers from a random matrix perspective.
We show that a textitspectral gap between the two largest singular values of the attention matrix causes rank collapse in width.
We propose a novel, yet simple, practical solution to resolve rank collapse in width by removing the spectral gap.
arXiv Detail & Related papers (2024-10-10T10:34:18Z) - FANFOLD: Graph Normalizing Flows-driven Asymmetric Network for Unsupervised Graph-Level Anomaly Detection [18.758250338590297]
Unsupervised graph-level anomaly detection (UGAD) has attracted increasing interest due to its widespread application.
We propose a Graph Normalizing Flows-driven Asymmetric Network For Unsupervised Graph-Level Anomaly Detection (FANFOLD)
arXiv Detail & Related papers (2024-06-29T09:49:16Z) - Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs [30.003409099607204]
We provide a formal and precise characterization of (linearized) graph neural networks (GNNs) with residual connections and normalization layers.
We show that the centering step of a normalization layer alters the graph signal in message-passing in such a way that relevant information can become harder to extract.
We introduce a novel, principled normalization layer called GraphNormv2 in which the centering step is learned such that it does not distort the original graph signal in an undesirable way.
arXiv Detail & Related papers (2024-06-05T06:53:16Z) - What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks.
This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z) - AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model [59.08735812631131]
Anomaly inspection plays an important role in industrial manufacture.
Existing anomaly inspection methods are limited in their performance due to insufficient anomaly data.
We propose AnomalyDiffusion, a novel diffusion-based few-shot anomaly generation model.
arXiv Detail & Related papers (2023-12-10T05:13:40Z) - Advective Diffusion Transformers for Topological Generalization in Graph
Learning [69.2894350228753]
We show how graph diffusion equations extrapolate and generalize in the presence of varying graph topologies.
We propose a novel graph encoder backbone, Advective Diffusion Transformer (ADiT), inspired by advective graph diffusion equations.
arXiv Detail & Related papers (2023-10-10T08:40:47Z) - DAGAD: Data Augmentation for Graph Anomaly Detection [57.92471847260541]
This paper devises a novel Data Augmentation-based Graph Anomaly Detection (DAGAD) framework for attributed graphs.
A series of experiments on three datasets prove that DAGAD outperforms ten state-of-the-art baseline detectors concerning various mostly-used metrics.
arXiv Detail & Related papers (2022-10-18T11:28:21Z) - Multilayer Clustered Graph Learning [66.94201299553336]
We use contrastive loss as a data fidelity term, in order to properly aggregate the observed layers into a representative graph.
Experiments show that our method leads to a clustered clusters w.r.t.
We learn a clustering algorithm for solving clustering problems.
arXiv Detail & Related papers (2020-10-29T09:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.