Related papers: A unified framework on the universal approximation of transformer-type architectures

A unified framework on the universal approximation of transformer-type architectures

URL: http://arxiv.org/abs/2506.23551v1
Date: Mon, 30 Jun 2025 06:50:39 GMT
Title: A unified framework on the universal approximation of transformer-type architectures
Authors: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen,
Abstract summary: We investigate the universal approximation property (UAP) of transformer-type architectures.<n>Our work identifies token distinguishability as a fundamental requirement for UAP.<n>We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms.
Score: 16.762119652883204
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.

Related papers

Cross-Model Semantics in Representation Learning [1.2064681974642195]
We show that structural regularities induce representational geometry that is more stable under architectural variation.<n>This suggests that certain forms of inductive bias not only support generalization within a model, but also improve the interoperability of learned features across models.
arXiv Detail & Related papers (2025-08-05T16:57:24Z)
Hierarchical Modeling and Architecture Optimization: Review and Unified Framework [0.6291443816903801]
This paper reviews literature on structured input spaces and proposes a unified framework that generalizes existing approaches.<n>A variable is described as meta if its value governs the presence of other decreed variables, enabling the modeling of conditional and hierarchical structures.
arXiv Detail & Related papers (2025-06-27T20:38:57Z)
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation [37.57021686999279]
This work focuses on the impact of sequence modeling architectures on base capabilities.<n>We first point out that the mixed domain pre-training setting fails to adequately reveal the differences in base capabilities among various architectures.<n>Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer.
arXiv Detail & Related papers (2025-05-24T05:40:03Z)
Directional Non-Commutative Monoidal Structures for Compositional Embeddings in Machine Learning [0.0]
We introduce a new structure for compositional embeddings built on directional non-commutative monoidal operators.<n>Our construction defines a distinct composition operator circ_i for each axis i, ensuring associative combination along each axis without imposing global commutativity.<n>All axis-specific operators commute with one another, enforcing a global interchange law that enables consistent crossaxis compositions.
arXiv Detail & Related papers (2025-05-21T13:27:14Z)
A Survey of Model Architectures in Information Retrieval [64.75808744228067]
We focus on two key aspects: backbone models for feature extraction and end-to-end system architectures for relevance estimation.<n>We trace the development from traditional term-based methods to modern neural approaches, particularly highlighting the impact of transformer-based models and subsequent large language models (LLMs)<n>We conclude by discussing emerging challenges and future directions, including architectural optimizations for performance and scalability, handling of multimodal, multilingual data, and adaptation to novel application domains beyond traditional search paradigms.
arXiv Detail & Related papers (2025-02-20T18:42:58Z)
Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation [54.50526986788175]
Recent advances in efficient sequence modeling have led to attention-free layers, such as Mamba, RWKV, and various gated RNNs. We present a unified view of these models, formulating such layers as implicit causal self-attention layers. Our framework compares the underlying mechanisms on similar grounds for different layers and provides a direct means for applying explainability methods.
arXiv Detail & Related papers (2024-05-26T09:57:45Z)
A KDM-Based Approach for Architecture Conformance Checking in Adaptive Systems [0.3858593544497595]
We present REMEDY, a domain-specific approach that encompasses the specification of the planned adaptive architecture based on the MAPE-K reference model. Our approach is specifically tailored for ASs, incorporating well-known rules from the MAPE-K model.
arXiv Detail & Related papers (2024-01-29T18:22:11Z)
HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain Generalization [69.33162366130887]
Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features. We introduce a novel method designed to supplement the model with domain-level and task-specific characteristics. This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization.
arXiv Detail & Related papers (2024-01-18T04:23:21Z)
Composing Task Knowledge with Modular Successor Feature Approximators [60.431769158952626]
We present a novel neural network architecture, "Modular Successor Feature Approximators" (MSFA) MSFA is able to better generalize compared to baseline architectures for learning SFs and modular architectures.
arXiv Detail & Related papers (2023-01-28T23:04:07Z)
Universal Information Extraction as Unified Semantic Matching [54.19974454019611]
We decouple information extraction into two abilities, structuring and conceptualizing, which are shared by different tasks and schemas. Based on this paradigm, we propose to universally model various IE tasks with Unified Semantic Matching framework. In this way, USM can jointly encode schema and input text, uniformly extract substructures in parallel, and controllably decode target structures on demand.
arXiv Detail & Related papers (2023-01-09T11:51:31Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)
Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional. We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.