Related papers: Tensor-GaLore: Memory-Efficient Training via Gradient Tensor Decomposition

Tensor-GaLore: Memory-Efficient Training via Gradient Tensor Decomposition

URL: http://arxiv.org/abs/2501.02379v1
Date: Sat, 04 Jan 2025 20:51:51 GMT
Title: Tensor-GaLore: Memory-Efficient Training via Gradient Tensor Decomposition
Authors: Robert Joseph George, David Pitt, Jiawei Zhao, Jean Kossaifi, Cheng Luo, Yuandong Tian, Anima Anandkumar,
Abstract summary: We present Navier-GaLore, a novel method for efficient training of neural networks with higher-order tensor weights.<n>Across various PDE tasks, Navier-GaLore achieves substantial memory savings, reducing memory usage by up to 75%.
Score: 93.98343072306619
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Tensor-GaLore, a novel method for efficient training of neural networks with higher-order tensor weights. Many models, particularly those used in scientific computing, employ tensor-parameterized layers to capture complex, multidimensional relationships. When scaling these methods to high-resolution problems makes memory usage grow intractably, and matrix based optimization methods lead to suboptimal performance and compression. We propose to work directly in the high-order space of the complex tensor parameter space using a tensor factorization of the gradients during optimization. We showcase its effectiveness on Fourier Neural Operators (FNOs), a class of models crucial for solving partial differential equations (PDE) and prove the theory of it. Across various PDE tasks like the Navier Stokes and Darcy Flow equations, Tensor-GaLore achieves substantial memory savings, reducing optimizer memory usage by up to 75%. These substantial memory savings across AI for science demonstrate Tensor-GaLore's potential.

Related papers

SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization [0.5755004576310332]
SMMF is a memory-efficient that reduces the memory requirement of the widely used adaptive learning rate Matrix, such as Adam, by up to 96%.<n>We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate Matrix, such as AdamNC.<n>In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memory efficients, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance.
arXiv Detail & Related papers (2024-12-12T03:14:50Z)
Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation [10.48108719012248]
We focus on Knowledge Distillation (KD), where a compact student model is trained to mimic a larger teacher model. In contrast to much of the previous work, we scale up the parameters of the student model during training.
arXiv Detail & Related papers (2024-11-10T12:40:59Z)
Multi-Grid Tensorized Fourier Neural Operator for High-Resolution PDEs [93.82811501035569]
We introduce a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization. MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena. We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150x compression.
arXiv Detail & Related papers (2023-09-29T20:18:52Z)
Geometry-Informed Neural Operator for Large-Scale 3D PDEs [76.06115572844882]
We propose the geometry-informed neural operator (GINO) to learn the solution operator of large-scale partial differential equations. We successfully trained GINO to predict the pressure on car surfaces using only five hundred data points.
arXiv Detail & Related papers (2023-09-01T16:59:21Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Symbolic Regression on FPGAs for Fast Machine Learning Inference [2.0920303420933273]
High-energy physics community is investigating the potential of deploying machine-learning-based solutions on Field-Programmable Gate Arrays (FPGAs) We introduce a novel end-to-end procedure that utilizes a machine learning technique called symbolic regression (SR) We show that our approach can approximate a 3-layer neural network using an inference model that achieves up to a 13-fold decrease in execution time, down to 5 ns, while still preserving more than 90% approximation accuracy.
arXiv Detail & Related papers (2023-05-06T17:04:02Z)
Low-Rank Tensor Function Representation for Multi-Dimensional Data Recovery [52.21846313876592]
Low-rank tensor function representation (LRTFR) can continuously represent data beyond meshgrid with infinite resolution. We develop two fundamental concepts for tensor functions, i.e., the tensor function rank and low-rank tensor function factorization. Our method substantiates the superiority and versatility of our method as compared with state-of-the-art methods.
arXiv Detail & Related papers (2022-12-01T04:00:38Z)
SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks. The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z)
Fourier Neural Operator for Parametric Partial Differential Equations [57.90284928158383]
We formulate a new neural operator by parameterizing the integral kernel directly in Fourier space. We perform experiments on Burgers' equation, Darcy flow, and Navier-Stokes equation. It is up to three orders of magnitude faster compared to traditional PDE solvers.
arXiv Detail & Related papers (2020-10-18T00:34:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.