Related papers: Spectral Insights into Data-Oblivious Critical Layers in Large Language Models

Spectral Insights into Data-Oblivious Critical Layers in Large Language Models

URL: http://arxiv.org/abs/2506.00382v2
Date: Wed, 04 Jun 2025 18:25:14 GMT
Title: Spectral Insights into Data-Oblivious Critical Layers in Large Language Models
Authors: Xuyuan Liu, Lei Hsiung, Yaoqing Yang, Yujun Yan,
Abstract summary: We introduce a data-oblivious approach to identify intrinsic critical layers in pre-fine-tuned language models.<n>We show that layers with significant shifts in representation space are also those most affected during fine-tuning.
Score: 7.486925126518052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding how feature representations evolve across layers in large language models (LLMs) is key to improving their interpretability and robustness. While recent studies have identified critical layers linked to specific functions or behaviors, these efforts typically rely on data-dependent analyses of fine-tuned models, limiting their use to post-hoc settings. In contrast, we introduce a data-oblivious approach to identify intrinsic critical layers in pre-fine-tuned LLMs by analyzing representation dynamics via Centered Kernel Alignment(CKA). We show that layers with significant shifts in representation space are also those most affected during fine-tuning--a pattern that holds consistently across tasks for a given model. Our spectral analysis further reveals that these shifts are driven by changes in the top principal components, which encode semantic transitions from rationales to conclusions. We further apply these findings to two practical scenarios: efficient domain adaptation, where fine-tuning critical layers leads to greater loss reduction compared to non-critical layers; and backdoor defense, where freezing them reduces attack success rates by up to 40%.

Related papers

Holes in Latent Space: Topological Signatures Under Adversarial Influence [1.193044160835091]
We propose persistent homology (PH), a tool from topological data analysis, to characterize multiscale latent space dynamics in language models.<n>We show that adversarial conditions consistently compress latent topologies, reducing structural diversity at smaller scales while amplifying dominant features at coarser ones.<n>We introduce a neuron-level PH framework that quantifies how information flows and transforms within and across layers.
arXiv Detail & Related papers (2025-05-26T18:31:49Z)
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards [13.197807179926428]
Large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern.<n>In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data.
arXiv Detail & Related papers (2025-05-22T15:30:00Z)
Mechanistic Interpretability of GPT-like Models on Summarization Tasks [2.4022340214033915]
This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks.<n>By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture.
arXiv Detail & Related papers (2025-05-20T02:15:11Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Layer by Layer: Uncovering Hidden Representations in Language Models [28.304269706993942]
We show that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks.<n>Our framework highlights how each model layer balances information compression and signal preservation.<n>These findings challenge the standard focus on final-layer embeddings and open new directions for model analysis and optimization.
arXiv Detail & Related papers (2025-02-04T05:03:42Z)
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z)
Understanding Layer Significance in LLM Alignment [23.582520695083588]
We propose identifying which layers within large language models are most critical to the alignment process.<n> Experimental results reveal that, despite substantial differences in alignment datasets, the important layers of a model exhibit nearly 90% overlap.<n>The results also indicate that freezing non-essential layers improves overall model performance, while selectively tuning the most critical layers significantly enhances fine-tuning efficiency with minimal performance loss.
arXiv Detail & Related papers (2024-10-23T13:47:05Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
Understanding and Diagnosing Vulnerability under Adversarial Attacks [62.661498155101654]
Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks. We propose a novel interpretability method, InterpretGAN, to generate explanations for features used for classification in latent variables. We also design the first diagnostic method to quantify the vulnerability contributed by each layer.
arXiv Detail & Related papers (2020-07-17T01:56:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.