Related papers: Controlled LLM Training on Spectral Sphere

Controlled LLM Training on Spectral Sphere

URL: http://arxiv.org/abs/2601.08393v1
Date: Tue, 13 Jan 2026 09:59:47 GMT
Title: Controlled LLM Training on Spectral Sphere
Authors: Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, Baining Guo,
Abstract summary: We introduce the textbfSpectral Sphere algorithm (SSO), which enforces strict module-wise spectral constraints on both weights and their updates.<n>We observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.
Score: 76.60985966206746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbolμ$P) provides a theoretical safeguard for width-invariant $Θ(1)$ activation control, whereas emerging optimizers like Muon are only ``half-aligned'' with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbolμ$P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

Related papers

Enhanced Maximum Independent Set Preparation with Rydberg Atoms Guided by the Spectral Gap [4.082216579462797]
We introduce a spectral-gap-guided schedule engineering method that modifies the laser detuning profile to suppress leakage.<n>We experimentally benchmark ADGLB on a quasi-one-dimensional chain of $N=10$ atoms.<n>We show that the schedule optimized for smaller instances can be directly applied to larger two-dimensional triangular lattices with $N=25$ and $N=37$.
arXiv Detail & Related papers (2026-02-20T04:58:12Z)
Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization [56.5199302532159]
We propose an Activation-guided Structured Regularization framework to suppress the negative effects of outliers.<n>Astro actively reconstructs intrinsically robust weights, aggressively suppressing weight outliers corresponding to high-magnitude activations.<n>Astro achieves highly competitive performance; notably, on LLaMA-2-7B, it achieves better performance than complex learning-based rotation methods with almost 1/3 of the quantization time.
arXiv Detail & Related papers (2026-02-07T15:50:18Z)
Merging Beyond: Streaming LLM Updates via Activation-Guided Rotations [55.047454145941366]
Streaming Merging is an innovative model updating paradigm that conceptualizes merging as an iterative optimization process.<n> ARM is a strategy designed to approximate gradient descent dynamics.<n> ARM requires only early SFT checkpoints and, through iterative merging, surpasses the fully converged SFT model.
arXiv Detail & Related papers (2026-02-03T08:15:57Z)
Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training [0.0]
We show how to reliably guarantee the spectral conditions required by $$P for large language model (LLM) training.<n>We develop a variant of Muon, namely Muon++, that satisfies spectral condition throughout the training process.
arXiv Detail & Related papers (2026-01-04T00:04:05Z)
Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales [55.91454326946738]
We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
arXiv Detail & Related papers (2025-12-05T11:03:41Z)
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z)
The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z)
SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training [22.230495941666096]
We introduce SlimPack, a framework that fundamentally rethinks data packing and scheduling by decomposing samples into fine-grained slices.<n>SlimPack mitigates critical memory and communication bottlenecks by transforming large, volatile workloads into a stream of smaller, manageable units.<n>Asymmetric Partitioning assembles balanced scheduling units uniquely optimized for the different demands of the forward and backward passes.
arXiv Detail & Related papers (2025-09-30T13:37:48Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Hardware Co-Designed Optimal Control for Programmable Atomic Quantum Processors via Reinforcement Learning [0.18416014644193068]
We introduce a hardware co-designed quantum control framework to address inherent imperfections in classical control hardware.<n>We demonstrate that the proposed framework enables robust, high-fidelity parallel single-qubit gate operations.<n>We find that while PPO performance degrades as system complexity increases, the end-to-end differentiable RL consistently achieves gate fidelities above 99.9$%$.
arXiv Detail & Related papers (2025-04-16T03:30:40Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.