POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation
- URL: http://arxiv.org/abs/2603.05500v1
- Date: Thu, 05 Mar 2026 18:59:23 GMT
- Title: POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation
- Authors: Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu,
- Abstract summary: We introduce POET-X, a scalable and memory-efficient variant for training large language models.<n>PoET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency.
- Score: 57.57816409869894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.
Related papers
- Orthogonal Finetuning Made Scalable [92.34573849209238]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without
arXiv Detail & Related papers (2025-06-24T17:59:49Z) - Scaling Probabilistic Circuits via Monarch Matrices [109.65822339230853]
Probabilistic Circuits (PCs) are tractable representations of probability distributions.<n>We propose a novel sparse and structured parameterization for the sum blocks in PCs.
arXiv Detail & Related papers (2025-06-14T07:39:15Z) - Reparameterized LLM Training via Orthogonal Equivalence Transformation [54.80172809738605]
We present POET, a novel training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons.<n>POET can stably optimize the objective function with improved generalization.<n>We develop efficient approximations that make POET flexible and scalable for training large-scale neural networks.
arXiv Detail & Related papers (2025-06-09T17:59:34Z) - SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization [0.5755004576310332]
SMMF is a memory-efficient that reduces the memory requirement of the widely used adaptive learning rate Matrix, such as Adam, by up to 96%.<n>We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate Matrix, such as AdamNC.<n>In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memory efficients, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance.
arXiv Detail & Related papers (2024-12-12T03:14:50Z) - Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores [3.6385567224218556]
Large language models (LLMs) have been widely applied but face challenges in efficient inference.
We introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization.
We implement an arbitrary precision matrix multiplication scheme that decomposes and recovers at the bit level, enabling flexible precision.
arXiv Detail & Related papers (2024-09-26T14:17:58Z) - Memory-Efficient Optimization with Factorized Hamiltonian Descent [11.01832755213396]
We introduce a novel adaptive, H-Fac, which incorporates a memory-efficient factorization approach to address this challenge.
By employing a rank-1 parameterization for both momentum and scaling parameter estimators, H-Fac reduces memory costs to a sublinear level.
We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings in optimization dynamics and convergence guarantees.
arXiv Detail & Related papers (2024-06-14T12:05:17Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Memory-Efficient Differentiable Programming for Quantum Optimal Control
of Discrete Lattices [1.5012666537539614]
Quantum optimal control problems are typically solved by gradient-based algorithms such as GRAPE.
QOC reveals that memory requirements are a barrier for simulating large models or long time spans.
We employ a nonstandard differentiable programming approach that significantly reduces the memory requirements at the cost of a reasonable amount of recomputation.
arXiv Detail & Related papers (2022-10-15T20:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.