Related papers: On the Optimization and Generalization of Multi-head Attention

Related papers

A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning [51.505728136705564]
We develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks.<n>We find that different initialization choices place the network into four distinct fine-tuning regimes.<n>A smaller scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization.
arXiv Detail & Related papers (2026-02-23T17:19:33Z)
Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization [21.788680301776207]
We propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization.<n>We first adopt a dense self-training objective as the primary source of structural self-supervision.<n>To further improve robustness, we introduce semantically-aware hierarchical normalization.
arXiv Detail & Related papers (2025-11-18T08:16:16Z)
Continual Learning with Query-Only Attention [0.0]
Continual learning involves learning from a stream of data without repetition of data points.<n>We propose a query-only attention mechanism that discards keys and values, yet preserves the core inductive bias of transformer architectures.
arXiv Detail & Related papers (2025-10-01T00:14:34Z)
Why Generate When You Can Transform? Unleashing Generative Attention for Dynamic Recommendation [9.365893765448366]
Sequential Recommendation (SR) focuses on personalizing user experiences by predicting future preferences based on historical interactions.<n> Transformer models, with their attention mechanisms, have become the dominant architecture in SR tasks.<n>We introduce two generative attention models for SR, each grounded in the principles of Variational Autoencoders (VAE) and Diffusion Models (DMs)
arXiv Detail & Related papers (2025-08-04T04:33:26Z)
Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models [68.57424628540907]
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets.<n>We introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms.<n>Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance.
arXiv Detail & Related papers (2025-07-12T08:10:10Z)
Tuning for Trustworthiness -- Balancing Performance and Explanation Consistency in Neural Network Optimization [49.567092222782435]
We introduce the novel concept of XAI consistency, defined as the agreement among different feature attribution methods.<n>We create a multi-objective optimization framework that balances predictive performance with explanation.<n>Our research provides a foundation for future investigations into whether models from the trade-off zone-balancing performance loss and XAI consistency-exhibit greater robustness.
arXiv Detail & Related papers (2025-05-12T13:19:14Z)
Partial Transportability for Domain Generalization [56.37032680901525]
Building on the theory of partial identification and transportability, this paper introduces new results for bounding the value of a functional of the target distribution. Our contribution is to provide the first general estimation technique for transportability problems. We propose a gradient-based optimization scheme for making scalable inferences in practice.
arXiv Detail & Related papers (2025-03-30T22:06:37Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [86.21199607040147]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation language models. We introduce Chain-of-Description, a step-by-step visual understanding method, and integrate structured chain-of-thought (CoT) reasoning to support in-depth multimodal reasoning. Extensive experiments demonstrate that SIcog produces next-generation foundation MLLMs with substantially improved multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z)
Single Domain Generalization with Model-aware Parametric Batch-wise Mixup [22.709796153794507]
Single Domain Generalization remains a formidable challenge in the field of machine learning. We propose a novel data augmentation approach, named as Model-aware Parametric Batch-wise Mixup. By exploiting inter-feature correlations, the parameterized mixup generator introduces additional versatility in combining features across a batch of instances.
arXiv Detail & Related papers (2025-02-22T03:45:18Z)
Multi-fidelity Machine Learning for Uncertainty Quantification and Optimization [4.557963624437784]
Multi-fidelity methods integrate high- and low-fidelity models to balance computational cost and predictive accuracy. This perspective paper provides an in-depth overview of the emerging field of machine learning-based multi-fidelity methods.
arXiv Detail & Related papers (2024-10-30T22:22:07Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data. We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z)
A Bayesian Unification of Self-Supervised Clustering and Energy-Based Models [11.007541337967027]
We perform a Bayesian analysis of state-of-the-art self-supervised learning objectives. We show that our objective function allows to outperform existing self-supervised learning strategies. We also demonstrate that GEDI can be integrated into a neuro-symbolic framework.
arXiv Detail & Related papers (2023-12-30T04:46:16Z)
Dance of Channel and Sequence: An Efficient Attention-Based Approach for Multivariate Time Series Forecasting [3.372816393214188]
CSformer is an innovative framework characterized by a meticulously engineered two-stage self-attention mechanism. We introduce sequence adapters and channel adapters, ensuring the model's ability to discern salient features across various dimensions.
arXiv Detail & Related papers (2023-12-11T09:10:38Z)
End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
Convexifying Transformers: Improving optimization and understanding of transformer networks [56.69983975369641]
We study the training problem of attention/transformer networks and introduce a novel convex analytic approach. We first introduce a convex alternative to the self-attention mechanism and reformulate the regularized training problem of transformer networks. As a byproduct of our convex analysis, we reveal an implicit regularization mechanism, which promotes sparsity across tokens.
arXiv Detail & Related papers (2022-11-20T18:17:47Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
Optimization-Inspired Learning with Architecture Augmentations and Control Mechanisms for Low-Level Vision [74.9260745577362]
This paper proposes a unified optimization-inspired learning framework to aggregate Generative, Discriminative, and Corrective (GDC) principles. We construct three propagative modules to effectively solve the optimization models with flexible combinations. Experiments across varied low-level vision tasks validate the efficacy and adaptability of GDC.
arXiv Detail & Related papers (2020-12-10T03:24:53Z)
Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional. We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.