Related papers: THG: Transformer with Hyperbolic Geometry

THG: Transformer with Hyperbolic Geometry

URL: http://arxiv.org/abs/2106.07350v1
Date: Tue, 1 Jun 2021 14:09:33 GMT
Title: THG: Transformer with Hyperbolic Geometry
Authors: Zhe Liu and Yibin Xu
Abstract summary: "X-former" models make changes only around the quadratic time and memory complexity of self-attention. We propose a novel Transformer with Hyperbolic Geometry (THG) model, which take the advantage of both Euclidean space and Hyperbolic space.
Score: 8.895324519034057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer model architectures have become an indispensable staple in deep learning lately for their effectiveness across a range of tasks. Recently, a surge of "X-former" models have been proposed which improve upon the original Transformer architecture. However, most of these variants make changes only around the quadratic time and memory complexity of self-attention, i.e. the dot product between the query and the key. What's more, they are calculate solely in Euclidean space. In this work, we propose a novel Transformer with Hyperbolic Geometry (THG) model, which take the advantage of both Euclidean space and Hyperbolic space. THG makes improvements in linear transformations of self-attention, which are applied on the input sequence to get the query and the key, with the proposed hyperbolic linear. Extensive experiments on sequence labeling task, machine reading comprehension task and classification task demonstrate the effectiveness and generalizability of our model. It also demonstrates THG could alleviate overfitting.

Related papers

Simplifying Graph Transformers [64.50059165186701]
We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L$ attention to measure the magnitude of closeness tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder.
arXiv Detail & Related papers (2025-04-17T02:06:50Z)
Hypformer: Exploring Efficient Hyperbolic Transformer Fully in Hyperbolic Space [47.4014545166959]
We introduce Hypformer, a novel hyperbolic Transformer based on the Lorentz model of hyperbolic geometry. We develop a linear self-attention mechanism in hyperbolic space, enabling hyperbolic Transformer to process billion-scale graph data and long-sequence inputs for the first time.
arXiv Detail & Related papers (2024-07-01T13:44:38Z)
EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention [88.45459681677369]
We propose a novel transformer variant with complex vector attention, named EulerFormer. It provides a unified theoretical framework to formulate both semantic difference and positional difference. It is more robust to semantic variations and possesses moresuperior theoretical properties in principle.
arXiv Detail & Related papers (2024-03-26T14:18:43Z)
Do Efficient Transformers Really Save Computation? [32.919672616480135]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z)
Hiformer: Heterogeneous Feature Interactions Learning with Transformers for Recommender Systems [27.781785405875084]
We propose to leverage a Transformer-based architecture with attention layers to automatically capture feature interactions. We identify two key challenges for applying the vanilla Transformer architecture to web-scale recommender systems.
arXiv Detail & Related papers (2023-11-10T05:57:57Z)
Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks [32.33355192614434]
We propose an effective and efficient surrogate of the Transformer, called Sliceformer. Our Sliceformer replaces the classic MHA mechanism with an extremely simple slicing-sorting'' operation. Our Sliceformer achieves comparable or better performance with lower memory cost and faster speed than the Transformer and its variants.
arXiv Detail & Related papers (2023-10-26T14:43:07Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.