Expanding Expressivity in Transformer Models with MöbiusAttention
- URL: http://arxiv.org/abs/2409.12175v1
- Date: Sun, 8 Sep 2024 16:56:33 GMT
- Title: Expanding Expressivity in Transformer Models with MöbiusAttention
- Authors: Anna-Maria Halacheva, Mojtaba Nayyeri, Steffen Staab,
- Abstract summary: M"obiusAttention integrates M"obius transformations within the attention mechanism of Transformer-based models.
By incorporating these properties, M"obiusAttention empowers models to learn more intricate geometric relationships between tokens.
- Score: 17.163751713885013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention mechanisms and Transformer architectures have revolutionized Natural Language Processing (NLP) by enabling exceptional modeling of long-range dependencies and capturing intricate linguistic patterns. However, their inherent reliance on linear operations in the form of matrix multiplications limits their ability to fully capture inter-token relationships on their own. We propose M\"obiusAttention, a novel approach that integrates M\"obius transformations within the attention mechanism of Transformer-based models. M\"obius transformations are non-linear operations in spaces over complex numbers with the ability to map between various geometries. By incorporating these properties, M\"obiusAttention empowers models to learn more intricate geometric relationships between tokens and capture a wider range of information through complex-valued weight vectors. We build and pre-train a BERT and a RoFormer version enhanced with M\"obiusAttention, which we then finetune on the GLUE benchmark. We evaluate empirically our approach against the baseline BERT and RoFormer models on a range of downstream tasks. Our approach compares favorably against the baseline models, even with smaller number of parameters suggesting the enhanced expressivity of M\"obiusAttention. This research paves the way for exploring the potential of M\"obius transformations in the complex projective space to enhance the expressivity and performance of foundation models.
Related papers
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - GeoMFormer: A General Architecture for Geometric Molecular Representation Learning [84.02083170392764]
We introduce a novel Transformer-based molecular model called GeoMFormer to achieve this goal.
We show that GeoMFormer achieves strong performance on both invariant and equivariant tasks of different types and scales.
arXiv Detail & Related papers (2024-06-24T17:58:13Z) - Optimal Matrix-Mimetic Tensor Algebras via Variable Projection [0.0]
Matrix mimeticity arises from interpreting tensors as operators that can be multiplied, factorized, and analyzed analogous to matrices.
We learn optimal linear mappings and corresponding tensor representations without relying on prior knowledge of the data.
We provide original theory of uniqueness of the transformation and convergence analysis of our variable-projection-based algorithm.
arXiv Detail & Related papers (2024-06-11T04:52:23Z) - Shape Arithmetic Expressions: Advancing Scientific Discovery Beyond Closed-Form Equations [56.78271181959529]
Generalized Additive Models (GAMs) can capture non-linear relationships between variables and targets, but they cannot capture intricate feature interactions.
We propose Shape Expressions Arithmetic ( SHAREs) that fuses GAM's flexible shape functions with the complex feature interactions found in mathematical expressions.
We also design a set of rules for constructing SHAREs that guarantee transparency of the found expressions beyond the standard constraints.
arXiv Detail & Related papers (2024-04-15T13:44:01Z) - Curve Your Attention: Mixed-Curvature Transformers for Graph
Representation Learning [77.1421343649344]
We propose a generalization of Transformers towards operating entirely on the product of constant curvature spaces.
We also provide a kernelized approach to non-Euclidean attention, which enables our model to run in time and memory cost linear to the number of nodes and edges.
arXiv Detail & Related papers (2023-09-08T02:44:37Z) - VTAE: Variational Transformer Autoencoder with Manifolds Learning [144.0546653941249]
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables.
The nonlinearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning.
We show that geodesics and accurate computation can substantially improve the performance of deep generative models.
arXiv Detail & Related papers (2023-04-03T13:13:19Z) - M\"{o}bius Convolutions for Spherical CNNs [26.91151736538527]
M"obius transformations play an important role in both geometry and spherical image processing.
We present a novel, M"obius-equivariant spherical convolution operator.
We demonstrate its utility by achieving promising results in both shape classification and image segmentation tasks.
arXiv Detail & Related papers (2022-01-28T16:11:47Z) - Disentangled Representation Learning and Generation with Manifold
Optimization [10.69910379275607]
This work presents a representation learning framework that explicitly promotes disentanglement by encouraging directions of variations.
Our theoretical discussion and various experiments show that the proposed model improves over many VAE variants in terms of both generation quality and disentangled representation learning.
arXiv Detail & Related papers (2020-06-12T10:00:49Z) - Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z) - Inverse Learning of Symmetries [71.62109774068064]
We learn the symmetry transformation with a model consisting of two latent subspaces.
Our approach is based on the deep information bottleneck in combination with a continuous mutual information regulariser.
Our model outperforms state-of-the-art methods on artificial and molecular datasets.
arXiv Detail & Related papers (2020-02-07T13:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.