Geometry-aware training of factorized layers in tensor Tucker format
- URL: http://arxiv.org/abs/2305.19059v2
- Date: Mon, 14 Oct 2024 10:17:52 GMT
- Title: Geometry-aware training of factorized layers in tensor Tucker format
- Authors: Emanuele Zangrando, Steffen Schotthöfer, Gianluca Ceruti, Jonas Kusch, Francesco Tudisco,
- Abstract summary: We introduce a novel approach to train the factors of a Tucker decomposition of the weight tensors.
Our training proposal proves to be optimal in locally approximating the original unfactorized dynamics.
We provide a theoretical analysis of the algorithm, showing convergence, approximation and local descent guarantees.
- Score: 6.701651480567394
- License:
- Abstract: Reducing parameter redundancies in neural network architectures is crucial for achieving feasible computational and memory requirements during training and inference phases. Given its easy implementation and flexibility, one promising approach is layer factorization, which reshapes weight tensors into a matrix format and parameterizes them as the product of two small rank matrices. However, this approach typically requires an initial full-model warm-up phase, prior knowledge of a feasible rank, and it is sensitive to parameter initialization. In this work, we introduce a novel approach to train the factors of a Tucker decomposition of the weight tensors. Our training proposal proves to be optimal in locally approximating the original unfactorized dynamics independently of the initialization. Furthermore, the rank of each mode is dynamically updated during training. We provide a theoretical analysis of the algorithm, showing convergence, approximation and local descent guarantees. The method's performance is further illustrated through a variety of experiments, showing remarkable training compression rates and comparable or even better performance than the full baseline and alternative layer factorization strategies.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries [10.209740962369453]
Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging.
A promising alternative is shallow weight factorization, where weights are pruning into two factors, allowing for optimization of $L$penalized neural networks.
In this work, we introduce deep weight factorization, adding differenti factors to more than two previous approaches.
arXiv Detail & Related papers (2025-02-04T17:12:56Z) - tCURLoRA: Tensor CUR Decomposition Based Low-Rank Parameter Adaptation and Its Application in Medical Image Segmentation [1.3281936946796913]
Transfer learning, by leveraging knowledge from pre-trained models, has significantly enhanced the performance of target tasks.
As deep neural networks scale up, full fine-tuning introduces substantial computational and storage challenges.
We propose tCURLoRA, a novel fine-tuning method based on tensor CUR decomposition.
arXiv Detail & Related papers (2025-01-04T08:25:32Z) - Advancing Neural Network Performance through Emergence-Promoting Initialization Scheme [0.0]
Emergence in machine learning refers to the spontaneous appearance of capabilities that arise from the scale and structure of training data.
We introduce a novel yet straightforward neural network initialization scheme that aims at achieving greater potential for emergence.
We demonstrate substantial improvements in both model accuracy and training speed, with and without batch normalization.
arXiv Detail & Related papers (2024-07-26T18:56:47Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z) - Defensive Tensorization [113.96183766922393]
We propose tensor defensiveization, an adversarial defence technique that leverages a latent high-order factorization of the network.
We empirically demonstrate the effectiveness of our approach on standard image classification benchmarks.
We validate the versatility of our approach across domains and low-precision architectures by considering an audio task and binary networks.
arXiv Detail & Related papers (2021-10-26T17:00:16Z) - Initialization and Regularization of Factorized Neural Layers [23.875225732697142]
We show how to initialize and regularize Factorized layers in deep nets.
We show how these schemes lead to improved performance on both translation and unsupervised pre-training.
arXiv Detail & Related papers (2021-05-03T17:28:07Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - A Multi-Scale Tensor Network Architecture for Classification and
Regression [0.0]
We present an algorithm for supervised learning using tensor networks.
We employ a step of preprocessing the data by coarse-graining through a sequence of wavelet transformations.
We show how fine-graining through the network may be used to initialize models with access to finer-scale features.
arXiv Detail & Related papers (2020-01-22T21:26:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.