Related papers: Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

URL: http://arxiv.org/abs/2509.15448v1
Date: Thu, 18 Sep 2025 21:44:07 GMT
Title: Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Authors: Saeed Amizadeh, Sara Abdali, Yinheng Li, Kazuhito Koishida,
Abstract summary: We first propose a mathematical construct to represent multi-modal, multi-scale data.<n>We then mathematically derive the neural attention mechanics for the proposed construct from the first principle of entropy minimization.<n>We show that the derived formulation is optimal in the sense of being the closest to the standard Softmax attention.
Score: 14.98480544580102
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers and their attention mechanism have been revolutionary in the field of Machine Learning. While originally proposed for the language data, they quickly found their way to the image, video, graph, etc. data modalities with various signal geometries. Despite this versatility, generalizing the attention mechanism to scenarios where data is presented at different scales from potentially different modalities is not straightforward. The attempts to incorporate hierarchy and multi-modality within transformers are largely based on ad hoc heuristics, which are not seamlessly generalizable to similar problems with potentially different structures. To address this problem, in this paper, we take a fundamentally different approach: we first propose a mathematical construct to represent multi-modal, multi-scale data. We then mathematically derive the neural attention mechanics for the proposed construct from the first principle of entropy minimization. We show that the derived formulation is optimal in the sense of being the closest to the standard Softmax attention while incorporating the inductive biases originating from the hierarchical/geometric information of the problem. We further propose an efficient algorithm based on dynamic programming to compute our derived attention mechanism. By incorporating it within transformers, we show that the proposed hierarchical attention mechanism not only can be employed to train transformer models in hierarchical/multi-modal settings from scratch, but it can also be used to inject hierarchical information into classical, pre-trained transformer models post training, resulting in more efficient models in zero-shot manner.

Related papers

Interpreting Transformer Architectures as Implicit Multinomial Regression [3.2371089062298317]
We show that in a fixed multinomial regression setting, optimizing over latent features yields optimal solutions that align with the dynamics induced by attention blocks.<n>In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.
arXiv Detail & Related papers (2025-09-04T20:40:37Z)
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [8.008567379796666]
We provide a fundamental understanding of what distinguishes the Transformer from the other architectures.<n>Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies.
arXiv Detail & Related papers (2024-10-14T18:15:02Z)
Dynamical Mean-Field Theory of Self-Attention Neural Networks [0.0]
Transformer-based models have demonstrated exceptional performance across diverse domains. Little is known about how they operate or what are their expected dynamics. We use methods for the study of asymmetric Hopfield networks in nonequilibrium regimes.
arXiv Detail & Related papers (2024-06-11T13:29:34Z)
Multi-Hierarchical Surrogate Learning for Structural Dynamical Crash Simulations Using Graph Convolutional Neural Networks [5.582881461692378]
We propose a multi-hierarchical framework for structurally creating a series of surrogate models for a kart frame. For multiscale phenomena, macroscale features are captured on a coarse surrogate, whereas microscale effects are resolved by finer ones. We train a graph-convolutional neural network-based surrogate that learns parameter-dependent low-dimensional latent dynamics on the coarsest representation.
arXiv Detail & Related papers (2024-02-14T15:22:59Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
From system models to class models: An in-context learning paradigm [0.0]
We introduce a novel paradigm for system identification, addressing two primary tasks: one-step-ahead prediction and multi-step simulation. We learn a meta model that represents a class of dynamical systems. For one-step prediction, a GPT-like decoder-only architecture is utilized, whereas the simulation problem employs an encoder-decoder structure.
arXiv Detail & Related papers (2023-08-25T13:50:17Z)
Mega: Moving Average Equipped Gated Attention [150.3124713793503]
Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average. We show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
arXiv Detail & Related papers (2022-09-21T20:52:17Z)
Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head. By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions. Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z)
Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning [114.36124979578896]
We design a dynamic mechanism using offline reinforcement learning algorithms. Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set.
arXiv Detail & Related papers (2022-05-05T05:44:26Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.