projUNN: efficient method for training deep networks with unitary
matrices
- URL: http://arxiv.org/abs/2203.05483v2
- Date: Fri, 11 Mar 2022 20:12:45 GMT
- Title: projUNN: efficient method for training deep networks with unitary
matrices
- Authors: Bobak Kiani, Randall Balestriero, Yann Lecun, Seth Lloyd
- Abstract summary: We introduce two variants of a method that can parameterize full $N$-dimensional unitary or matrices with a training runtime scaling as $O(kN2)$.
Even in the fastest setting, projUNN is able to train a model's unitary parameters to reach comparable performances against baseline implementations.
- Score: 21.11571804661279
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In learning with recurrent or very deep feed-forward networks, employing
unitary matrices in each layer can be very effective at maintaining long-range
stability. However, restricting network parameters to be unitary typically
comes at the cost of expensive parameterizations or increased training runtime.
We propose instead an efficient method based on rank-$k$ updates -- or their
rank-$k$ approximation -- that maintains performance at a nearly optimal
training runtime. We introduce two variants of this method, named Direct
(projUNN-D) and Tangent (projUNN-T) projected Unitary Neural Networks, that can
parameterize full $N$-dimensional unitary or orthogonal matrices with a
training runtime scaling as $O(kN^2)$. Our method either projects low-rank
gradients onto the closest unitary matrix (projUNN-T) or transports unitary
matrices in the direction of the low-rank gradient (projUNN-D). Even in the
fastest setting ($k=1$), projUNN is able to train a model's unitary parameters
to reach comparable performances against baseline implementations. By
integrating our projUNN algorithm into both recurrent and convolutional neural
networks, our models can closely match or exceed benchmarked results from
state-of-the-art algorithms.
Related papers
- Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes [3.246129789918632]
The training of deep neural networks is inherently a non- optimization problem.<n>Standard approaches such as gradient descent (SGD) require simultaneous updates to parameters.<n>We propose a novel method Alternating Train Miniization with tailored step sizes (SAMT)<n>SAMT achieves better performance with fewer parameter updates compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-08-06T08:23:38Z) - Stacey: Promoting Stochastic Steepest Descent via Accelerated $\ell_p$-Smooth Nonconvex Optimization [15.179519413549086]
We introduce a new accelerated $ell_p$ steepest descent algorithm, called Stacey, to handle non-Euclidean smooth optimization tasks.<n>In addition to providing theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against popular methods.
arXiv Detail & Related papers (2025-06-07T00:47:07Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball.
We propose a new family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems.
arXiv Detail & Related papers (2025-02-11T13:10:34Z) - Training Artificial Neural Networks by Coordinate Search Algorithm [0.20971479389679332]
We propose an efficient version of the gradient-free Coordinate Search (CS) algorithm for training neural networks.
The proposed algorithm can be used with non-differentiable activation functions and tailored to multi-objective/multi-loss problems.
Finding the optimal values for weights of ANNs is a large-scale optimization problem.
arXiv Detail & Related papers (2024-02-20T01:47:25Z) - A Mini-Block Natural Gradient Method for Deep Neural Networks [12.48022619079224]
We propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF)
Our novel approach utilizes the parallelism of generalization to efficiently perform on the large number of matrices in each layer.
arXiv Detail & Related papers (2022-02-08T20:01:48Z) - Algorithms for Efficiently Learning Low-Rank Neural Networks [12.916132936159713]
We study algorithms for learning low-rank neural networks.
We present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network.
We propose a novel low-rank framework for training low-rank $textitdeep$ networks.
arXiv Detail & Related papers (2022-02-02T01:08:29Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Can we learn gradients by Hamiltonian Neural Networks? [68.8204255655161]
We propose a meta-learner based on ODE neural networks that learns gradients.
We demonstrate that our method outperforms a meta-learner based on LSTM for an artificial task and the MNIST dataset with ReLU activations in the optimizee.
arXiv Detail & Related papers (2021-10-31T18:35:10Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z) - Stochastic Flows and Geometric Optimization on the Orthogonal Group [52.50121190744979]
We present a new class of geometrically-driven optimization algorithms on the orthogonal group $O(d)$.
We show that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, flows and metric learning.
arXiv Detail & Related papers (2020-03-30T15:37:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.