Related papers: Expanding Expressivity in Transformer Models with MöbiusAttention

Expanding Expressivity in Transformer Models with MöbiusAttention

URL: http://arxiv.org/abs/2409.12175v1
Date: Sun, 8 Sep 2024 16:56:33 GMT
Title: Expanding Expressivity in Transformer Models with MöbiusAttention
Authors: Anna-Maria Halacheva, Mojtaba Nayyeri, Steffen Staab,
Abstract summary: M"obiusAttention integrates M"obius transformations within the attention mechanism of Transformer-based models. By incorporating these properties, M"obiusAttention empowers models to learn more intricate geometric relationships between tokens.
Score: 17.163751713885013
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention mechanisms and Transformer architectures have revolutionized Natural Language Processing (NLP) by enabling exceptional modeling of long-range dependencies and capturing intricate linguistic patterns. However, their inherent reliance on linear operations in the form of matrix multiplications limits their ability to fully capture inter-token relationships on their own. We propose M\"obiusAttention, a novel approach that integrates M\"obius transformations within the attention mechanism of Transformer-based models. M\"obius transformations are non-linear operations in spaces over complex numbers with the ability to map between various geometries. By incorporating these properties, M\"obiusAttention empowers models to learn more intricate geometric relationships between tokens and capture a wider range of information through complex-valued weight vectors. We build and pre-train a BERT and a RoFormer version enhanced with M\"obiusAttention, which we then finetune on the GLUE benchmark. We evaluate empirically our approach against the baseline BERT and RoFormer models on a range of downstream tasks. Our approach compares favorably against the baseline models, even with smaller number of parameters suggesting the enhanced expressivity of M\"obiusAttention. This research paves the way for exploring the potential of M\"obius transformations in the complex projective space to enhance the expressivity and performance of foundation models.

Related papers

DiffuMeta: Algebraic Language Models for Inverse Design of Metamaterials via Diffusion Transformers [0.6531893282486697]
We present DiffuMeta, a generative framework integrating diffusion transformers with a novel algebraic language representation, encoding 3D geometries as mathematical sentences.<n>This compact, unified parameterization spans diverse topologies while enabling direct application of transformers to structural design.<n>Our approach enables simultaneous control over multiple mechanical objectives, including linear and nonlinear responses beyond training domains.
arXiv Detail & Related papers (2025-07-21T16:09:26Z)
Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z)
Sequential-Parallel Duality in Prefix Scannable Models [68.39855814099997]
Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
arXiv Detail & Related papers (2025-06-12T17:32:02Z)
Graphical Transformation Models [0.9999629695552195]
We introduce a novel approach to effectively model multivariate data with intricate marginals and complex dependency structures non-parametrically. We show how to approximately regularize GTMs using a lasso penalty towards pairwise conditional independencies. The model's robustness and effectiveness are validated through simulations, showcasing its ability to accurately learn parametric vine copulas.
arXiv Detail & Related papers (2025-03-22T19:41:15Z)
Superpose Singular Features for Model Merging [29.728307343119894]
Superpose Features from Task Matrix (SFTM) is a novel approach that superposes features from individual task models into a merged model. Our method consistently outperforms existing methods, achieving superior performance and enhanced out-of-distribution generalization.
arXiv Detail & Related papers (2025-02-15T07:05:55Z)
MIND: Microstructure INverse Design with Generative Hybrid Neural Representation [25.55691106041371]
inverse design of microstructures plays a pivotal role in optimizing metamaterials with specific, targeted physical properties. We present a novel generative model that integrates latent diffusion with Holoplane, an advanced hybrid neural representation that simultaneously encodes both geometric and physical properties. Our approach generalizes across multiple microstructure classes, enabling the generation of diverse, tileable microstructures with significantly improved property accuracy and enhanced control over geometric validity.
arXiv Detail & Related papers (2025-02-01T20:25:47Z)
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs) Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z)
GeoMFormer: A General Architecture for Geometric Molecular Representation Learning [84.02083170392764]
We introduce a novel Transformer-based molecular model called GeoMFormer to achieve this goal. We show that GeoMFormer achieves strong performance on both invariant and equivariant tasks of different types and scales.
arXiv Detail & Related papers (2024-06-24T17:58:13Z)
Optimal Matrix-Mimetic Tensor Algebras via Variable Projection [0.0]
Matrix mimeticity arises from interpreting tensors as operators that can be multiplied, factorized, and analyzed analogous to matrices. We learn optimal linear mappings and corresponding tensor representations without relying on prior knowledge of the data. We provide original theory of uniqueness of the transformation and convergence analysis of our variable-projection-based algorithm.
arXiv Detail & Related papers (2024-06-11T04:52:23Z)
Shape Arithmetic Expressions: Advancing Scientific Discovery Beyond Closed-Form Equations [56.78271181959529]
Generalized Additive Models (GAMs) can capture non-linear relationships between variables and targets, but they cannot capture intricate feature interactions. We propose Shape Expressions Arithmetic ( SHAREs) that fuses GAM's flexible shape functions with the complex feature interactions found in mathematical expressions. We also design a set of rules for constructing SHAREs that guarantee transparency of the found expressions beyond the standard constraints.
arXiv Detail & Related papers (2024-04-15T13:44:01Z)
Curve Your Attention: Mixed-Curvature Transformers for Graph Representation Learning [77.1421343649344]
We propose a generalization of Transformers towards operating entirely on the product of constant curvature spaces. We also provide a kernelized approach to non-Euclidean attention, which enables our model to run in time and memory cost linear to the number of nodes and edges.
arXiv Detail & Related papers (2023-09-08T02:44:37Z)
VTAE: Variational Transformer Autoencoder with Manifolds Learning [144.0546653941249]
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables. The nonlinearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning. We show that geodesics and accurate computation can substantially improve the performance of deep generative models.
arXiv Detail & Related papers (2023-04-03T13:13:19Z)
M\"{o}bius Convolutions for Spherical CNNs [26.91151736538527]
M"obius transformations play an important role in both geometry and spherical image processing. We present a novel, M"obius-equivariant spherical convolution operator. We demonstrate its utility by achieving promising results in both shape classification and image segmentation tasks.
arXiv Detail & Related papers (2022-01-28T16:11:47Z)
Disentangled Representation Learning and Generation with Manifold Optimization [10.69910379275607]
This work presents a representation learning framework that explicitly promotes disentanglement by encouraging directions of variations. Our theoretical discussion and various experiments show that the proposed model improves over many VAE variants in terms of both generation quality and disentangled representation learning.
arXiv Detail & Related papers (2020-06-12T10:00:49Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)
Inverse Learning of Symmetries [71.62109774068064]
We learn the symmetry transformation with a model consisting of two latent subspaces. Our approach is based on the deep information bottleneck in combination with a continuous mutual information regulariser. Our model outperforms state-of-the-art methods on artificial and molecular datasets.
arXiv Detail & Related papers (2020-02-07T13:48:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.