Related papers: Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

URL: http://arxiv.org/abs/2511.00797v1
Date: Sun, 02 Nov 2025 04:32:41 GMT
Title: Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation
Authors: Wang Zixian,
Abstract summary: Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning.<n>We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis.<n>We propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore suppressed backward signals.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning. We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis, showing that gradient suppression at inflection layers confines adaptation to high-level recombination of existing features while preventing low-level reconstruction. We introduce a set of layer-wise diagnostic metrics -- attention entropy (saturation proxy), activation gradient norm, parameter gradient norm, and Delta-CKA under a shared PCA basis -- to identify inflection layers characterized by both low attention entropy and steep gradient decay. Building on these findings, we propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore suppressed backward signals with minimal parameter overhead. Experiments on BERT-base transfer from SST-2 to Rotten Tomatoes under under-trained and over-trained source regimes reveal that over-trained initialization benefits from inflection-layer LoRA injection, while under-trained initialization suffers performance degradation. When base features are strong, unblocking inflection layers facilitates high-level compositional adaptation; when base features are weak, full-pathway unblocking is required for low-level reconstruction, as supported by joint analysis of layer-wise activation gradients and Delta-CKA dynamics.

Related papers

Robust Depth Super-Resolution via Adaptive Diffusion Sampling [32.09035309959689]
AdaDS robustly recovers high-resolution depth maps from arbitrarily degraded inputs.<n>AdaDS capitalizes on the contraction property of Gaussian smoothing.<n>Experiments on real-world and synthetic benchmarks demonstrate AdaDS's superior zero-shot generalization.
arXiv Detail & Related papers (2026-02-10T08:10:02Z)
Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models [19.448467763421707]
Large language models (LLMs) continue to grow, making parameter-efficient fine-tuning the default strategy for downstream adaptation.<n>Current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection.<n>This paper develops a unified projected residual view of PEFT on top of a frozen base model.
arXiv Detail & Related papers (2026-02-03T21:05:55Z)
The Inlet Rank Collapse in Implicit Neural Representations: Diagnosis and Unified Remedy [30.776360295485762]
Implicit Neural Representations (INRs) have revolutionized continuous signal modeling, yet they struggle to recover fine-grained details within finite training budgets.<n>We introduce a structural diagnostic framework to identify the Inlet Rank Collapse'', a phenomenon where the low-dimensional input coordinates fail to span the high-dimensional embedding space.<n>We derive a Rank-Expanding Initialization, a minimalist remedy that ensures the representation rank scales with the layer width without architectural modifications or computational overhead.
arXiv Detail & Related papers (2026-02-02T01:38:19Z)
SPINAL -- Scaling-law and Preference Integration in Neural Alignment Layers [16.976750197698063]
We introduce SPINAL, a diagnostic that measures how alignment reshapes representations across depth.<n>Across model families, DPO produces a layerwise calibration effect concentrated in the final decoder blocks.<n>Aligned checkpoints show a late-layer ramp-up in contraction and a smooth reduction in transport, consistent with tightened and stabilized policy mass.
arXiv Detail & Related papers (2026-01-08T17:47:12Z)
Uncertainty-Guided Selective Adaptation Enables Cross-Platform Predictive Fluorescence Microscopy [65.15943255667733]
We introduce Subnetwork Image Translation ADDA with automatic depth selection (SIT-ADDA-Auto)<n>We show that adapting only the earliest convolutional layers, while freezing deeper layers, yields reliable transfer.<n>Our results provide a design rule for label-free adaptation in microscopy and a recipe for field settings; the code is publicly available.
arXiv Detail & Related papers (2025-11-15T03:01:05Z)
Generative Model Inversion Through the Lens of the Manifold Hypothesis [98.37040155914595]
Model inversion attacks (MIAs) aim to reconstruct class-representative samples from trained models.<n>Recent generative MIAs utilize generative adversarial networks to learn image priors that guide the inversion process.
arXiv Detail & Related papers (2025-09-24T14:39:25Z)
CEM-FBGTinyDet: Context-Enhanced Foreground Balance with Gradient Tuning for tiny Objects [2.321156185872456]
We propose E-FPN-BS, a novel architecture integrating multi-scale feature enhancement and adaptive optimization.<n>First, our Context Enhancement Module(CEM) employs dual-branch processing to align and compress high-level features for effective global-local fusion.<n>Second, the Foreground-Background Separation Module (FBSM) generates spatial gating masks that dynamically amplify discriminative regions.
arXiv Detail & Related papers (2025-06-11T16:13:38Z)
GRILL: Gradient Signal Restoration in Ill-Conditioned Layers to Enhance Adversarial Attacks on Autoencoders [4.046100165562807]
We introduce GRILL, a technique that restores gradient signals in ill-conditioned layers, enabling more effective norm-bounded attacks.<n>We show that our method significantly increases the effectiveness of our adversarial attacks, enabling a more rigorous evaluation of AE robustness.
arXiv Detail & Related papers (2025-05-06T15:52:14Z)
Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise. Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise. We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z)
Mjolnir: Breaking the Shield of Perturbation-Protected Gradients via Adaptive Diffusion [13.764770382623812]
We present the first attempt to break the shield of gradient perturbation protection in Federated Learning.<n>We introduce Mjolnir, a perturbation-resilient gradient leakage attack.<n>Mjolnir is capable of removing perturbations from gradients without requiring additional access to the original model structure or external data.
arXiv Detail & Related papers (2024-07-07T07:06:49Z)
Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters [69.24377241408851]
Overfitting to the source domain is a common issue in gradient-based training of deep neural networks. We propose to base the selection on gradient-signal-to-noise ratio (GSNR) of network's parameters.
arXiv Detail & Related papers (2023-10-11T10:21:34Z)
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth. We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth. We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z)
GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy. Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.