AttentionDrop: A Novel Regularization Method for Transformer Models
- URL: http://arxiv.org/abs/2504.12088v2
- Date: Fri, 19 Sep 2025 11:47:37 GMT
- Title: AttentionDrop: A Novel Regularization Method for Transformer Models
- Authors: Mirza Samad Ahmed Baig, Syeda Anshrah Gillani, Abdul Akbar Khan, Shahid Munir Shah, Muhammad Omer Khan,
- Abstract summary: Transformer-based architectures achieve state-of-the-art performance across a wide range of tasks in natural language processing, computer vision, and speech processing.<n>However, their immense capacity often leads to overfitting, especially when training data is limited or noisy.<n>This research proposes a unified family of regularization techniques, which operate directly on the self-attention distributions.
- Score: 0.3262230127283452
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based architectures achieve state-of-the-art performance across a wide range of tasks in natural language processing, computer vision, and speech processing. However, their immense capacity often leads to overfitting, especially when training data is limited or noisy. In this research, a unified family of stochastic regularization techniques has been proposed, i.e. AttentionDrop with its three different variants, which operate directly on the self-attention distributions. Hard Attention Masking randomly zeroes out top-k attention logits per query to encourage diverse context utilization, Blurred Attention Smoothing applies a dynamic Gaussian convolution over attention logits to diffuse overly peaked distributions, and Consistency-Regularized AttentionDrop enforces output stability under multiple independent AttentionDrop perturbations via a KL-based consistency loss. Results achieved in the study demonstrate that AttentionDrop consistently improves accuracy, calibration, and adversarial robustness over standard Dropout, DropConnect, and R-Drop baselines
Related papers
- Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention [14.827874140211328]
Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization.<n>We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights.
arXiv Detail & Related papers (2026-02-26T14:42:16Z) - DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting [59.868414584142336]
DropoutTS is a model-agnostic plugin that shifts the paradigm from "what" to "how much" to learn.<n>It maps noise to adaptive dropout rates - selectively suppressing spurious fluctuations while preserving fine-grained fidelity.
arXiv Detail & Related papers (2026-01-29T13:49:20Z) - From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers [0.0]
Transformer-based models have been widely adopted for sentiment analysis tasks due to their exceptional ability to capture contextual information.<n>We observe that existing models tend to allocate attention primarily to common words, overlooking less popular yet highly task-relevant terms.<n>We propose an Adversarial Feedback for Attention(AFA) training mechanism that enables the model to automatically redistribute attention weights to appropriate focal points.
arXiv Detail & Related papers (2025-12-19T01:48:25Z) - Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language Models [22.573044825857043]
Dropout Prompt Learning aims for applying dropout to improve the robustness of vision-language models.<n>Our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.
arXiv Detail & Related papers (2025-12-08T07:31:27Z) - $\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion [65.77755100137728]
We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens.<n>E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average.
arXiv Detail & Related papers (2025-11-26T16:14:20Z) - Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance [8.46069844016289]
Adversarial Sinkhorn Attention Guidance (ASAG) is a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport.<n>Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys.<n>ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet.
arXiv Detail & Related papers (2025-11-10T15:52:53Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - Control and Realism: Best of Both Worlds in Layout-to-Image without Training [59.16447569868382]
We present WinWinLay, a training-free method for layout-to-Image generation.<n>We propose two key strategies, Non-local Attention Energy and Adaptive Update, that collaboratively enhance control precision and realism.<n>WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
arXiv Detail & Related papers (2025-06-18T15:39:02Z) - Relevance-driven Input Dropout: an Explanation-guided Regularization Technique [10.97680893924652]
Overfitting is a well-known issue extending even to state-of-the-art (SOTA) Machine Learning (ML) models.<n> Mitigation measures include a combination of dropout, data augmentation, weight decay, and other regularization techniques.<n>We propose Relevance-driven Input Dropout (RelDrop), a novel data augmentation method which selectively occludes the most relevant regions of the input.
arXiv Detail & Related papers (2025-05-27T16:52:29Z) - Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z) - Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion [33.9786226622757]
We propose a robust motion generation framework MoMADiff to generate 3D human motion from text descriptions.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
arXiv Detail & Related papers (2025-05-16T09:06:15Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - Focus What Matters: Matchability-Based Reweighting for Local Feature Matching [6.361840891399624]
We propose a novel attention reweighting mechanism that simultaneously incorporates a learnable bias term into the attention logits.<n>Experiments conducted on three benchmark datasets validate the effectiveness of our method.
arXiv Detail & Related papers (2025-05-04T15:50:28Z) - A Language Anchor-Guided Method for Robust Noisy Domain Generalization [20.83580289888522]
We introduce Anchor Alignment and Adaptive Weighting (A3W)<n>A3W uses sample reweighting guided by natural language processing (NLP) anchors to extract more representative features.<n>It consistently outperforms state-of-the-art domain generalization methods.
arXiv Detail & Related papers (2025-03-21T15:20:28Z) - Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift [51.24522135151649]
Anomaly detection plays a crucial role in quality control for industrial applications.<n>Existing methods attempt to address domain shifts by training generalizable models.<n>Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.
arXiv Detail & Related papers (2025-03-19T05:25:52Z) - Generalized Interpolating Discrete Diffusion [65.74168524007484]
Masked diffusion is a popular choice due to its simplicity and effectiveness.<n>We derive the theoretical backbone of a family of general interpolating discrete diffusion processes.<n>Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise.
arXiv Detail & Related papers (2025-03-06T14:30:55Z) - Breaking the Bias: Recalibrating the Attention of Industrial Anomaly Detection [20.651257973799527]
Recalibrating Attention of Industrial Anomaly Detection (RAAD) is a framework that systematically decomposes and recalibrates attention maps.
HQS dynamically adjusts bit-widths based on the hierarchical nature of attention maps.
We validate the effectiveness of RAAD on 32 datasets using a single 3090ti.
arXiv Detail & Related papers (2024-12-11T08:31:47Z) - Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory [33.78620829249978]
Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images.
Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding.
We propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties.
Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment.
arXiv Detail & Related papers (2024-11-25T10:57:48Z) - Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study [38.492552119793]
We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings.<n>We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention.<n>When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods.
arXiv Detail & Related papers (2024-10-23T15:51:13Z) - Self-attention Networks Localize When QK-eigenspectrum Concentrates [9.379890125442335]
Self-attention mechanism prevails in modern machine learning.
Two arguments have connected attention localization to the model performances.
We show that a small eigenspectrum variance leads attention to be localized.
arXiv Detail & Related papers (2024-02-03T09:35:53Z) - PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z) - WeaNF: Weak Supervision with Normalizing Flows [4.446580498787894]
Weak supervision introduces problems of noisy labels, coverage and bias.
We generatively model the input-side data distributions covered by labeling functions.
We analyze the effectiveness and modeling capabilities on various commonly used weak supervision data sets.
arXiv Detail & Related papers (2022-04-28T10:59:54Z) - Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks.
We show that their output can be decomposed into a sum of smaller terms.
We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z) - Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z) - Scheduled DropHead: A Regularization Method for Transformer Models [111.18614166615968]
DropHead is a structured dropout method specifically designed for regularizing the multi-head attention mechanism.
It drops entire attention-heads during training.
It prevents the multi-head attention model from being dominated by a small portion of attention heads.
arXiv Detail & Related papers (2020-04-28T07:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.