Strategic Fusion Optimizes Transformer Compression
- URL: http://arxiv.org/abs/2501.03273v1
- Date: Sun, 05 Jan 2025 04:46:14 GMT
- Title: Strategic Fusion Optimizes Transformer Compression
- Authors: Md Shoaibur Rahman,
- Abstract summary: This study investigates transformer model compression by systematically pruning its layers.<n>We evaluated 14 pruning strategies across nine diverse datasets, including 12 strategies based on different signals obtained from layer activations, mutual information, gradients, weights, and attention.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study investigates transformer model compression by systematically pruning its layers. We evaluated 14 pruning strategies across nine diverse datasets, including 12 strategies based on different signals obtained from layer activations, mutual information, gradients, weights, and attention. To address the limitations of single-signal strategies, we introduced two fusion strategies, linear regression and random forest, which combine individual strategies (i.e., strategic fusion), for more informed pruning decisions. Additionally, we applied knowledge distillation to mitigate any accuracy loss during layer pruning. Our results reveal that random forest strategic fusion outperforms individual strategies in seven out of nine datasets and achieves near-optimal performance in the other two. The distilled random forest surpasses the original accuracy in six datasets and mitigates accuracy drops in the remaining three. Knowledge distillation also improves the accuracy-to-size ratio by an average factor of 18.84 across all datasets. Supported by mathematical foundations and biological analogies, our findings suggest that strategically combining multiple signals can lead to efficient, high-performing transformer models for resource-constrained applications.
Related papers
- Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data [1.3607388598209322]
Local Climate Zones (LCZs) give a zoning map to study urban structures and land use.<n>Data fusion is significant for improving accuracy owing to the data complexity.<n>This study analyzes different fusion strategies in the multi-class LCZ classification models.
arXiv Detail & Related papers (2026-03-04T19:47:13Z) - TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training [53.93696896939915]
Training tool-use agents typically rely on Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.<n>We propose TopoCurate, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology.<n>TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T10:38:54Z) - How does the optimizer implicitly bias the model merging loss landscape? [66.96572894292895]
We show that a single quantity -- the effective noise scale -- unifies the impact of inference and data choices on model merging.<n>Across datasets, the effectiveness of merging success is a non-monotonic function of effective noise, with a distinct optimum.
arXiv Detail & Related papers (2025-10-06T10:56:41Z) - Plan before Solving: Problem-Aware Strategy Routing for Mathematical Reasoning with LLMs [49.995906301946]
Existing methods usually leverage a fixed strategy to guide Large Language Models (LLMs) to perform mathematical reasoning.<n>Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and efficiency.<n>We propose Planning and Routing through Instance-Specific Modeling (PRISM), a novel framework that decouples mathematical reasoning into two stages: strategy planning and targeted execution.
arXiv Detail & Related papers (2025-09-29T07:22:41Z) - Layer Pruning with Consensus: A Triple-Win Solution [0.0]
Layer-pruning approaches often rely on single criteria that may not fully capture the complex, underlying properties of layers.
We propose a novel approach that combines multiple similarity metrics into a single expressive measure of low-importance layers, called the Consensus criterion.
Our technique delivers a triple-win solution: low accuracy drop, high-performance improvement, and increased robustness to adversarial attacks.
arXiv Detail & Related papers (2024-11-21T17:41:27Z) - Efficient learning of differential network in multi-source non-paranormal graphical models [2.5905193932831585]
This paper addresses learning of sparse structural changes or differential network between two classes of non-paranormal graphical models.
Our strategy in combining datasets from multiple sources is shown to be very effective in inferring differential network in real-world problems.
arXiv Detail & Related papers (2024-10-03T13:59:38Z) - Exploring Selective Layer Fine-Tuning in Federated Learning [48.470385357429215]
Federated learning (FL) has emerged as a promising paradigm for fine-tuning foundation models using distributed data.
We study selective layer fine-tuning in FL, emphasizing a flexible approach that allows the clients to adjust their selected layers according to their local data and resources.
arXiv Detail & Related papers (2024-08-28T07:48:39Z) - LayerMatch: Do Pseudo-labels Benefit All Layers? [77.59625180366115]
Semi-supervised learning offers a promising solution to mitigate the dependency of labeled data.
We develop two layer-specific pseudo-label strategies, termed Grad-ReLU and Avg-Clustering.
Our approach consistently demonstrates exceptional performance on standard semi-supervised learning benchmarks.
arXiv Detail & Related papers (2024-06-20T11:25:50Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Enhancing Privacy against Inversion Attacks in Federated Learning by
using Mixing Gradients Strategies [0.31498833540989407]
Federated learning reduces the risk of information leakage, but remains vulnerable to attacks.
We show how several neural network design decisions can defend against gradients inversion attacks.
These strategies are also shown to be useful for deep convolutional neural networks such as LeNET for image recognition.
arXiv Detail & Related papers (2022-04-26T12:08:28Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Hierarchical Dynamic Filtering Network for RGB-D Salient Object
Detection [91.43066633305662]
The main purpose of RGB-D salient object detection (SOD) is how to better integrate and utilize cross-modal fusion information.
In this paper, we explore these issues from a new perspective.
We implement a kind of more flexible and efficient multi-scale cross-modal feature processing.
arXiv Detail & Related papers (2020-07-13T07:59:55Z) - Joint Multi-Dimension Pruning via Numerical Gradient Update [120.59697866489668]
We present joint multi-dimension pruning (abbreviated as JointPruning), an effective method of pruning a network on three crucial aspects: spatial, depth and channel simultaneously.
We show that our method is optimized collaboratively across the three dimensions in a single end-to-end training and it is more efficient than the previous exhaustive methods.
arXiv Detail & Related papers (2020-05-18T17:57:09Z) - Classification of Hyperspectral and LiDAR Data Using Coupled CNNs [39.55503477017984]
We propose an efficient framework to fuse hyperspectral and Light Detection And Ranging (LiDAR) data using two coupled convolutional neural networks (CNNs)
One CNN is designed to learn spectral-spatial features from hyperspectral data, the other is used to capture the elevation information from LiDAR data.
In the fusion phase, feature-level and decision-level fusion methods are simultaneously used to integrate these heterogeneous features.
arXiv Detail & Related papers (2020-02-04T06:23:36Z) - An empirical evaluation of imbalanced data strategies from a
practitioner's point of view [1.9580473532948401]
This paper evaluates six strategies for mitigating imbalanced data: oversampling, undersampling, ensemble methods, specialized algorithms, class weight adjustments, and a no-mitigation approach.
These strategies were tested on 58 real-life binary imbalanced datasets with imbalance rates ranging from 3 to 120.
arXiv Detail & Related papers (2018-10-16T17:50:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.