LORS: Low-rank Residual Structure for Parameter-Efficient Network
Stacking
- URL: http://arxiv.org/abs/2403.04303v1
- Date: Thu, 7 Mar 2024 08:10:59 GMT
- Title: LORS: Low-rank Residual Structure for Parameter-Efficient Network
Stacking
- Authors: Jialin Li, Qiang Nie, Weifu Fu, Yuhuan Lin, Guangpin Tao, Yong Liu,
Chengjie Wang
- Abstract summary: LORS allows stacked modules to share the majority of parameters, requiring a much smaller number of unique ones per module to match or even surpass the performance of using entirely distinct ones.
We validate our method by applying it to the stacked decoders of a query-based object detector, and conduct extensive experiments on the widely used MS COCO dataset.
- Score: 37.24438285812178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models, particularly those based on transformers, often employ
numerous stacked structures, which possess identical architectures and perform
similar functions. While effective, this stacking paradigm leads to a
substantial increase in the number of parameters, posing challenges for
practical applications. In today's landscape of increasingly large models,
stacking depth can even reach dozens, further exacerbating this issue. To
mitigate this problem, we introduce LORS (LOw-rank Residual Structure). LORS
allows stacked modules to share the majority of parameters, requiring a much
smaller number of unique ones per module to match or even surpass the
performance of using entirely distinct ones, thereby significantly reducing
parameter usage. We validate our method by applying it to the stacked decoders
of a query-based object detector, and conduct extensive experiments on the
widely used MS COCO dataset. Experimental results demonstrate the effectiveness
of our method, as even with a 70\% reduction in the parameters of the decoder,
our method still enables the model to achieve comparable or
Related papers
- Mixture of Parrots: Experts improve memorization more than reasoning [72.445819694797]
We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate.
We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
arXiv Detail & Related papers (2024-10-24T17:54:41Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Do deep neural networks utilize the weight space efficiently? [2.9914612342004503]
Deep learning models like Transformers and Convolutional Neural Networks (CNNs) have revolutionized various domains, but their parameter-intensive nature hampers deployment in resource-constrained settings.
We introduce a novel concept utilizing column space and row space of weight matrices, which allows for a substantial reduction in model parameters without compromising performance.
Our approach applies to both Bottleneck and Attention layers, effectively halving the parameters while incurring only minor performance degradation.
arXiv Detail & Related papers (2024-01-26T21:51:49Z) - MGAS: Multi-Granularity Architecture Search for Trade-Off Between Model
Effectiveness and Efficiency [10.641875933652647]
We introduce multi-granularity architecture search (MGAS) to discover both effective and efficient neural networks.
We learn discretization functions specific to each granularity level to adaptively determine the unit remaining ratio according to the evolving architecture.
Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate that MGAS outperforms other state-of-the-art methods in achieving a better trade-off between model performance and model size.
arXiv Detail & Related papers (2023-10-23T16:32:18Z) - Can the Query-based Object Detector Be Designed with Fewer Stages? [15.726619371300558]
We propose a novel model called GOLO, which follows a two-stage decoding paradigm.
Compared to other mainstream query-based models with multi-stage decoders, our model employs fewer decoder stages while still achieving considerable performance.
arXiv Detail & Related papers (2023-09-28T09:58:52Z) - Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges.
Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning.
A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z) - Multi-Agent Reinforcement Learning for Microprocessor Design Space
Exploration [71.95914457415624]
Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency.
We propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem.
Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines.
arXiv Detail & Related papers (2022-11-29T17:10:24Z) - MiniALBERT: Model Distillation via Parameter-Efficient Recursive
Transformers [12.432191400869002]
MiniALBERT is a technique for converting the knowledge of fully parameterised LMs (such as BERT) into a compact recursive student.
We test our proposed models on a number of general and biomedical NLP tasks to demonstrate their viability and compare them with the state-of-the-art and other existing compact models.
arXiv Detail & Related papers (2022-10-12T17:23:21Z) - Hierarchical Dynamic Filtering Network for RGB-D Salient Object
Detection [91.43066633305662]
The main purpose of RGB-D salient object detection (SOD) is how to better integrate and utilize cross-modal fusion information.
In this paper, we explore these issues from a new perspective.
We implement a kind of more flexible and efficient multi-scale cross-modal feature processing.
arXiv Detail & Related papers (2020-07-13T07:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.