Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders
- URL: http://arxiv.org/abs/2502.16722v1
- Date: Sun, 23 Feb 2025 21:29:50 GMT
- Title: Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders
- Authors: Suneel Nadipalli,
- Abstract summary: Fine-tuning pre-trained transformers is a powerful technique for enhancing the performance of base models on specific tasks.<n>This paper explores the underlying mechanisms of fine-tuning, specifically in the BERT transformer.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning pre-trained transformers is a powerful technique for enhancing the performance of base models on specific tasks. From early applications in models like BERT to fine-tuning Large Language Models (LLMs), this approach has been instrumental in adapting general-purpose architectures for specialized downstream tasks. Understanding the fine-tuning process is crucial for uncovering how transformers adapt to specific objectives, retain general representations, and acquire task-specific features. This paper explores the underlying mechanisms of fine-tuning, specifically in the BERT transformer, by analyzing activation similarity, training Sparse AutoEncoders (SAEs), and visualizing token-level activations across different layers. Based on experiments conducted across multiple datasets and BERT layers, we observe a steady progression in how features adapt to the task at hand: early layers primarily retain general representations, middle layers act as a transition between general and task-specific features, and later layers fully specialize in task adaptation. These findings provide key insights into the inner workings of fine-tuning and its impact on representation learning within transformer architectures.
Related papers
- Small transformer architectures for task switching [2.7195102129095003]
It is non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches.<n>We show that standard transformers cannot solve a basic task switching reference model.<n>We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies.
arXiv Detail & Related papers (2025-08-06T14:01:05Z) - Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models [12.112842686827669]
Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks.<n>We investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability.<n>Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer's effectiveness and efficiency on sequence learning across diverse tasks.
arXiv Detail & Related papers (2025-06-02T17:39:31Z) - Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement [13.38679135071682]
We propose a Function-preserving Attention Replacement framework that replaces all attention blocks in pretrained transformers with learnable sequence-to-sequence modules.<n>We validate FAR on the DeiT vision transformer family and demonstrate that it matches the accuracy of the original models on ImageNet and multiple downstream tasks with reduced parameters and latency.
arXiv Detail & Related papers (2025-05-24T02:23:46Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain
Generalization [69.33162366130887]
Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features.
We introduce a novel method designed to supplement the model with domain-level and task-specific characteristics.
This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization.
arXiv Detail & Related papers (2024-01-18T04:23:21Z) - FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised
Pretraining [36.44039681893334]
Hyperspectral images (HSIs) contain rich spectral and spatial information.
Current state-of-the-art hyperspectral transformers only tokenize the input HSI sample along the spectral dimension.
We propose a novel factorized spectral-spatial transformer that incorporates factorized self-supervised pretraining procedures.
arXiv Detail & Related papers (2023-09-18T02:05:52Z) - Semantic-visual Guided Transformer for Few-shot Class-incremental
Learning [6.300141694311465]
We develop a semantic-visual guided Transformer (SV-T) to enhance the feature extracting capacity of the pre-trained feature backbone on incremental classes.
Our SV-T can take full advantage of more supervision information from base classes and further enhance the training robustness of the feature backbone.
arXiv Detail & Related papers (2023-03-27T15:06:49Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Systematic Generalization and Emergent Structures in Transformers
Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions.
We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition.
These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z) - Unveiling Transformers with LEGO: a synthetic reasoning task [23.535488809197787]
We study how the transformer architecture learns to follow a chain of reasoning.
In some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning.
We find that one can prevent such shortcut with appropriate architecture modification or careful data preparation.
arXiv Detail & Related papers (2022-06-09T06:30:17Z) - AutoTrans: Automating Transformer Design via Reinforced Architecture
Search [52.48985245743108]
This paper empirically explore how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand.
Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers.
arXiv Detail & Related papers (2020-09-04T08:46:22Z) - MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and
Architectures [61.73533544385352]
We propose a transferable perturbation, MetaPerturb, which is meta-learned to improve generalization performance on unseen data.
As MetaPerturb is a set-function trained over diverse distributions across layers and tasks, it can generalize heterogeneous tasks and architectures.
arXiv Detail & Related papers (2020-06-13T02:54:59Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.