Efficient Compression of Overparameterized Deep Models through
Low-Dimensional Learning Dynamics
- URL: http://arxiv.org/abs/2311.05061v2
- Date: Mon, 11 Mar 2024 18:55:33 GMT
- Title: Efficient Compression of Overparameterized Deep Models through
Low-Dimensional Learning Dynamics
- Authors: Soo Min Kwon, Zekai Zhang, Dogyoon Song, Laura Balzano, Qing Qu
- Abstract summary: We present a novel approach for compressing over parameterized models.
Our algorithm improves the training efficiency by more than 2x, without compromising generalization.
- Score: 10.673414267895355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overparameterized models have proven to be powerful tools for solving various
machine learning tasks. However, overparameterization often leads to a
substantial increase in computational and memory costs, which in turn requires
extensive resources to train. In this work, we present a novel approach for
compressing overparameterized models, developed through studying their learning
dynamics. We observe that for many deep models, updates to the weight matrices
occur within a low-dimensional invariant subspace. For deep linear models, we
demonstrate that their principal components are fitted incrementally within a
small subspace, and use these insights to propose a compression algorithm for
deep linear networks that involve decreasing the width of their intermediate
layers. We empirically evaluate the effectiveness of our compression technique
on matrix recovery problems. Remarkably, by using an initialization that
exploits the structure of the problem, we observe that our compressed network
converges faster than the original network, consistently yielding smaller
recovery errors. We substantiate this observation by developing a theory
focused on deep matrix factorization. Finally, we empirically demonstrate how
our compressed model has the potential to improve the utility of deep nonlinear
models. Overall, our algorithm improves the training efficiency by more than
2x, without compromising generalization.
Related papers
- Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation [10.48108719012248]
We focus on Knowledge Distillation (KD), where a compact student model is trained to mimic a larger teacher model.
In contrast to much of the previous work, we scale up the parameters of the student model during training.
arXiv Detail & Related papers (2024-11-10T12:40:59Z) - Generalized Nested Latent Variable Models for Lossy Coding applied to Wind Turbine Scenarios [14.48369551534582]
A learning-based approach seeks to minimize the compromise between compression rate and reconstructed image quality.
A successful technique consists in introducing a deep hyperprior that operates within a 2-level nested latent variable model.
This paper extends this concept by designing a generalized L-level nested generative model with a Markov chain structure.
arXiv Detail & Related papers (2024-06-10T11:00:26Z) - Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation [12.07880147193174]
We show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of over parameterization without the computational burdens.
We demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models.
arXiv Detail & Related papers (2024-06-06T14:29:49Z) - Learning Nonlinear Projections for Reduced-Order Modeling of Dynamical
Systems using Constrained Autoencoders [0.0]
We introduce a class of nonlinear projections described by constrained autoencoder neural networks in which both the manifold and the projection fibers are learned from data.
Our architecture uses invertible activation functions and biorthogonal weight matrices to ensure that the encoder is a left inverse of the decoder.
We also introduce new dynamics-aware cost functions that promote learning of oblique projection fibers that account for fast dynamics and nonnormality.
arXiv Detail & Related papers (2023-07-28T04:01:48Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - FOSTER: Feature Boosting and Compression for Class-Incremental Learning [52.603520403933985]
Deep neural networks suffer from catastrophic forgetting when learning new categories.
We propose a novel two-stage learning paradigm FOSTER, empowering the model to learn new categories adaptively.
arXiv Detail & Related papers (2022-04-10T11:38:33Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z) - Hyperbolic Neural Networks++ [66.16106727715061]
We generalize the fundamental components of neural networks in a single hyperbolic geometry model, namely, the Poincar'e ball model.
Experiments show the superior parameter efficiency of our methods compared to conventional hyperbolic components, and stability and outperformance over their Euclidean counterparts.
arXiv Detail & Related papers (2020-06-15T08:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.