Related papers: AttentionDrop: A Novel Regularization Method for Transformer Models

Related papers

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention [14.827874140211328]
Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization.<n>We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights.
arXiv Detail & Related papers (2026-02-26T14:42:16Z)
DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting [59.868414584142336]
DropoutTS is a model-agnostic plugin that shifts the paradigm from "what" to "how much" to learn.<n>It maps noise to adaptive dropout rates - selectively suppressing spurious fluctuations while preserving fine-grained fidelity.
arXiv Detail & Related papers (2026-01-29T13:49:20Z)
From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers [0.0]
Transformer-based models have been widely adopted for sentiment analysis tasks due to their exceptional ability to capture contextual information.<n>We observe that existing models tend to allocate attention primarily to common words, overlooking less popular yet highly task-relevant terms.<n>We propose an Adversarial Feedback for Attention(AFA) training mechanism that enables the model to automatically redistribute attention weights to appropriate focal points.
arXiv Detail & Related papers (2025-12-19T01:48:25Z)
Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language Models [22.573044825857043]
Dropout Prompt Learning aims for applying dropout to improve the robustness of vision-language models.<n>Our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.
arXiv Detail & Related papers (2025-12-08T07:31:27Z)
$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion [65.77755100137728]
We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens.<n>E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average.
arXiv Detail & Related papers (2025-11-26T16:14:20Z)
Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance [8.46069844016289]
Adversarial Sinkhorn Attention Guidance (ASAG) is a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport.<n>Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys.<n>ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet.
arXiv Detail & Related papers (2025-11-10T15:52:53Z)
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z)
Control and Realism: Best of Both Worlds in Layout-to-Image without Training [59.16447569868382]
We present WinWinLay, a training-free method for layout-to-Image generation.<n>We propose two key strategies, Non-local Attention Energy and Adaptive Update, that collaboratively enhance control precision and realism.<n>WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
arXiv Detail & Related papers (2025-06-18T15:39:02Z)
Relevance-driven Input Dropout: an Explanation-guided Regularization Technique [10.97680893924652]
Overfitting is a well-known issue extending even to state-of-the-art (SOTA) Machine Learning (ML) models.<n> Mitigation measures include a combination of dropout, data augmentation, weight decay, and other regularization techniques.<n>We propose Relevance-driven Input Dropout (RelDrop), a novel data augmentation method which selectively occludes the most relevant regions of the input.
arXiv Detail & Related papers (2025-05-27T16:52:29Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion [33.9786226622757]
We propose a robust motion generation framework MoMADiff to generate 3D human motion from text descriptions.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
arXiv Detail & Related papers (2025-05-16T09:06:15Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Focus What Matters: Matchability-Based Reweighting for Local Feature Matching [6.361840891399624]
We propose a novel attention reweighting mechanism that simultaneously incorporates a learnable bias term into the attention logits.<n>Experiments conducted on three benchmark datasets validate the effectiveness of our method.
arXiv Detail & Related papers (2025-05-04T15:50:28Z)
A Language Anchor-Guided Method for Robust Noisy Domain Generalization [20.83580289888522]
We introduce Anchor Alignment and Adaptive Weighting (A3W)<n>A3W uses sample reweighting guided by natural language processing (NLP) anchors to extract more representative features.<n>It consistently outperforms state-of-the-art domain generalization methods.
arXiv Detail & Related papers (2025-03-21T15:20:28Z)
Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift [51.24522135151649]
Anomaly detection plays a crucial role in quality control for industrial applications.<n>Existing methods attempt to address domain shifts by training generalizable models.<n>Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.
arXiv Detail & Related papers (2025-03-19T05:25:52Z)
Generalized Interpolating Discrete Diffusion [65.74168524007484]
Masked diffusion is a popular choice due to its simplicity and effectiveness.<n>We derive the theoretical backbone of a family of general interpolating discrete diffusion processes.<n>Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise.
arXiv Detail & Related papers (2025-03-06T14:30:55Z)
Breaking the Bias: Recalibrating the Attention of Industrial Anomaly Detection [20.651257973799527]
Recalibrating Attention of Industrial Anomaly Detection (RAAD) is a framework that systematically decomposes and recalibrates attention maps. HQS dynamically adjusts bit-widths based on the hierarchical nature of attention maps. We validate the effectiveness of RAAD on 32 datasets using a single 3090ti.
arXiv Detail & Related papers (2024-12-11T08:31:47Z)
Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory [33.78620829249978]
Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding. We propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment.
arXiv Detail & Related papers (2024-11-25T10:57:48Z)
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study [38.492552119793]
We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings.<n>We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention.<n>When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods.
arXiv Detail & Related papers (2024-10-23T15:51:13Z)
Self-attention Networks Localize When QK-eigenspectrum Concentrates [9.379890125442335]
Self-attention mechanism prevails in modern machine learning. Two arguments have connected attention localization to the model performances. We show that a small eigenspectrum variance leads attention to be localized.
arXiv Detail & Related papers (2024-02-03T09:35:53Z)
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor. PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
WeaNF: Weak Supervision with Normalizing Flows [4.446580498787894]
Weak supervision introduces problems of noisy labels, coverage and bias. We generatively model the input-side data distributions covered by labeling functions. We analyze the effectiveness and modeling capabilities on various commonly used weak supervision data sets.
arXiv Detail & Related papers (2022-04-28T10:59:54Z)
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks. We show that their output can be decomposed into a sum of smaller terms. We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z)
Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize. Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z)
Scheduled DropHead: A Regularization Method for Transformer Models [111.18614166615968]
DropHead is a structured dropout method specifically designed for regularizing the multi-head attention mechanism. It drops entire attention-heads during training. It prevents the multi-head attention model from being dominated by a small portion of attention heads.
arXiv Detail & Related papers (2020-04-28T07:33:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.