Attention Projection Mixing with Exogenous Anchors
- URL: http://arxiv.org/abs/2601.08131v2
- Date: Thu, 22 Jan 2026 12:45:06 GMT
- Title: Attention Projection Mixing with Exogenous Anchors
- Authors: Jonathan Su,
- Abstract summary: Cross-layer reuse of early attention projections can improve data efficiency, but it creates a structural conflict.<n>We show this conflict is a hidden limiter of internal-anchor designs.<n>We propose ExoFormer, which resolves the conflict by learning anchor projections outside the sequential layer stack.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We show this ''first-layer tension'' is a hidden limiter of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack, decoupling the anchor role from computational refinement. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise/headwise/scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5 downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in refinement. We release code and models to facilitate future research.
Related papers
- HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference [11.718567830546538]
Long-context inference in Large Language Models is bottlenecked by the quadratic computation complexity of attention.<n>We introduce bf HyLRA, a novel framework driven by layer-wise sparsity profiling.<n>We show that HyLRA improves inference throughput by 6%--46% while maintaining comparable performance.
arXiv Detail & Related papers (2026-01-31T15:36:17Z) - A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training [86.64715217940274]
Outliers function jointly with normalization.<n>Outliers serve more as rescale factors rather than contributors.<n>Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling.
arXiv Detail & Related papers (2026-01-30T13:29:45Z) - STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models [11.965535230928372]
Store is a unified and scalable token-based ranking framework built upon three core innovations.<n>Our framework consistently improves prediction accuracy(online CTR by 2.71%, AUC by 1.195%) and training effeciency (1.84 throughput)
arXiv Detail & Related papers (2025-11-24T06:20:02Z) - Deconstructing Attention: Investigating Design Principles for Effective Language Modeling [37.92951508140559]
Transformer language models are widely credited with their dot-product attention mechanism.<n>This work systematically deconstructs attention by designing controlled variants that relax these principles.<n>Surprisingly, even variants that fail in isolation can achieve robust performance when interleaved with standard attention.
arXiv Detail & Related papers (2025-10-13T16:42:14Z) - FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning [1.912429179274357]
We present textscFLUID-Flow-Latent Unified Integration via Token Distillation for Expert components.<n>textscFLUID contributes three core elements: (1) emphQ-transforms, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time.
arXiv Detail & Related papers (2025-08-10T09:34:17Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition [9.83509397800422]
We propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs.<n>ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores.<n>HSSFGN employs gating mechanism to achieve multi-scale feature representation.
arXiv Detail & Related papers (2025-03-15T05:13:26Z) - Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks.<n>While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers.<n>We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z) - Disentangled Interleaving Variational Encoding [1.132458063021286]
We propose a principled approach to disentangle the original input into marginal and conditional probability distributions in the latent space of a variational autoencoder.<n>Our proposed model, Deep Disentangled Interleaving Variational.<n>coder (DeepDIVE), learns disentangled features from the original input to form clusters in the embedding space.<n>Experiments on two public datasets show that DeepDIVE disentangles the original input and yields forecast accuracies better than the original VAE.
arXiv Detail & Related papers (2025-01-15T10:50:54Z) - Continuous Knowledge-Preserving Decomposition with Adaptive Layer Selection for Few-Shot Class-Incremental Learning [73.59672160329296]
CKPD-FSCIL is a unified framework that unlocks the underutilized capacity of pretrained weights.<n>Our method consistently outperforms state-of-the-art approaches in both adaptability and knowledge retention.
arXiv Detail & Related papers (2025-01-09T07:18:48Z) - Long-Sequence Recommendation Models Need Decoupled Embeddings [49.410906935283585]
We identify and characterize a neglected deficiency in existing long-sequence recommendation models.<n>A single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes.<n>We propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are learned separately to fully decouple attention and representation.
arXiv Detail & Related papers (2024-10-03T15:45:15Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders.
We first develop an adaptive feature mask generator to account for the unique significance of nodes.
We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z) - Defensive Tensorization [113.96183766922393]
We propose tensor defensiveization, an adversarial defence technique that leverages a latent high-order factorization of the network.
We empirically demonstrate the effectiveness of our approach on standard image classification benchmarks.
We validate the versatility of our approach across domains and low-precision architectures by considering an audio task and binary networks.
arXiv Detail & Related papers (2021-10-26T17:00:16Z) - Polarized Self-Attention: Towards High-quality Pixel-wise Regression [19.2303932008785]
This paper presents the Polarized Self-Attention(PSA) block that incorporates two critical designs towards high-quality pixel-wise regression.
Experimental results show that PSA boosts standard baselines by $2-4$ points, and boosts state-of-the-arts by $1-2$ points on 2D pose estimation and semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-07-02T01:03:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.