Few Dimensions are Enough: Fine-tuning BERT with Selected Dimensions Revealed Its Redundant Nature
- URL: http://arxiv.org/abs/2504.04966v1
- Date: Mon, 07 Apr 2025 11:53:16 GMT
- Title: Few Dimensions are Enough: Fine-tuning BERT with Selected Dimensions Revealed Its Redundant Nature
- Authors: Shion Fukuhata, Yoshinobu Kano,
- Abstract summary: Fine-tuning BERT models for specific tasks is common.<n>It is common to select part of the final layer's output and input it into a newly created fully connected layer.<n>It remains unclear which part of the final layer should be selected and what information each dimension of the layers holds.
- Score: 1.1970409518725493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When fine-tuning BERT models for specific tasks, it is common to select part of the final layer's output and input it into a newly created fully connected layer. However, it remains unclear which part of the final layer should be selected and what information each dimension of the layers holds. In this study, we comprehensively investigated the effectiveness and redundancy of token vectors, layers, and dimensions through BERT fine-tuning on GLUE tasks. The results showed that outputs other than the CLS vector in the final layer contain equivalent information, most tasks require only 2-3 dimensions, and while the contribution of lower layers decreases, there is little difference among higher layers. We also evaluated the impact of freezing pre-trained layers and conducted cross-fine-tuning, where fine-tuning is applied sequentially to different tasks. The findings suggest that hidden layers may change significantly during fine-tuning, BERT has considerable redundancy, enabling it to handle multiple tasks simultaneously, and its number of dimensions may be excessive.
Related papers
- Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination.
Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z) - Not all layers are equally as important: Every Layer Counts BERT [5.121744234312891]
This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models.
Our approach allows each transformer layer to select which outputs of previous layers to process.
arXiv Detail & Related papers (2023-11-03T23:08:50Z) - Improving Reliability of Fine-tuning with Block-wise Optimisation [6.83082949264991]
Finetuning can be used to tackle domain-specific tasks by transferring knowledge.
We propose a novel block-wise optimization mechanism, which adapts the weights of a group of layers of a pre-trained model.
The proposed approaches are tested on an often-used dataset, Tf_flower.
arXiv Detail & Related papers (2023-01-15T16:20:18Z) - TrimBERT: Tailoring BERT for Trade-offs [6.068076825261616]
We show that reducing the number of intermediate layers in BERT-Base results in minimal fine-tuning accuracy loss of downstream tasks.
We further mitigate two key bottlenecks, by replacing all softmax operations in the self-attention layers with a computationally simpler alternative.
arXiv Detail & Related papers (2022-02-24T23:06:29Z) - Layerwise Optimization by Gradient Decomposition for Continual Learning [78.58714373218118]
Deep neural networks achieve state-of-the-art and sometimes super-human performance across various domains.
When learning tasks sequentially, the networks easily forget the knowledge of previous tasks, known as "catastrophic forgetting"
arXiv Detail & Related papers (2021-05-17T01:15:57Z) - Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [72.38919601150175]
We propose Bilayer Convolutional Network (BCNet) to segment highly-overlapping objects.
BCNet detects the occluding objects (occluder) and the bottom GCN layer infers partially occluded instance (occludee)
arXiv Detail & Related papers (2021-03-23T06:25:42Z) - Undivided Attention: Are Intermediate Layers Necessary for BERT? [2.8935588665357077]
We investigate the importance of intermediate layers on the overall network performance of downstream tasks.
We show that reducing the number of intermediate layers and modifying the architecture for BERT-BASE results in minimal loss in fine-tuning accuracy for downstream tasks.
arXiv Detail & Related papers (2020-12-22T08:46:14Z) - Dual-constrained Deep Semi-Supervised Coupled Factorization Network with
Enriched Prior [80.5637175255349]
We propose a new enriched prior based Dual-constrained Deep Semi-Supervised Coupled Factorization Network, called DS2CF-Net.
To ex-tract hidden deep features, DS2CF-Net is modeled as a deep-structure and geometrical structure-constrained neural network.
Our network can obtain state-of-the-art performance for representation learning and clustering.
arXiv Detail & Related papers (2020-09-08T13:10:21Z) - MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task.
The first stage generates an initial prediction that is refined by the next ones.
Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z) - BERT's output layer recognizes all hidden layers? Some Intriguing
Phenomena and a simple way to boost BERT [53.63288887672302]
Bidirectional Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks.
We find that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input.
We propose a quite simple method to boost the performance of BERT.
arXiv Detail & Related papers (2020-01-25T13:35:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.