MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration
- URL: http://arxiv.org/abs/2602.01734v1
- Date: Mon, 02 Feb 2026 07:18:45 GMT
- Title: MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration
- Authors: Lianhai Ren, Yucheng Ding, Xiao Liu, Qianxiao Li, Peng Cheng, Yeyun Gong,
- Abstract summary: Training instability remains a critical challenge in large language model pretraining.<n>We study training failures in a 5M NanoGPT model scaled via $$P.<n>We propose MSign, a new norm that periodically applies matrix sign operations to restore stable rank.
- Score: 48.446476072756276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via $μ$P, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.
Related papers
- STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens [38.425692691443764]
ExistingReinforcement Learning (RL) fine-tuning methods rely heavily on entropy regularization and reweighting to maintain stability.<n>In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training.<n>We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01%, which we term spurious tokens.<n>We propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement.
arXiv Detail & Related papers (2026-02-17T14:46:48Z) - AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents [75.67445299298949]
AgentCPM-Explore is a compact 4B agent model with high knowledge density and strong exploration capability.<n>We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement.<n>AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks.
arXiv Detail & Related papers (2026-02-06T08:24:59Z) - Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z) - Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning [19.22530791401551]
We introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats.<n>We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset.<n>A Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-03T06:59:42Z) - HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-BENCH [11.643006508214887]
SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks.<n>Standard metrics such as Perplexity (PPL) are compromised by the "Long-Context Tax" and exhibit weak correlation with downstream SWE performance.<n>We propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States.
arXiv Detail & Related papers (2026-01-28T05:03:24Z) - M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization [9.358876832727239]
Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>We find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades.<n>We introduce M-GRPO, a framework that leverages a slowly evolving momentum model to provide a stable training target.<n>We also propose an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories.
arXiv Detail & Related papers (2025-12-15T08:07:23Z) - Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning [70.56067503630486]
We argue that sixth-generation (6G) intelligence is not fluent token prediction but calibrated the capacity to imagine and choose.<n>We show that WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference.
arXiv Detail & Related papers (2025-11-04T17:22:22Z) - MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics [72.00014675808228]
Instability in Large Language Models evaluation process obscures true learning dynamics.<n>We introduce textbfMaP, a framework that integrates underlineMerging underlineand the underlinePass@k metric.<n>Experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent rankings.
arXiv Detail & Related papers (2025-10-10T11:40:27Z) - Overtrained Language Models Are Harder to Fine-Tune [64.44743256512237]
Large language models are pre-trained on ever-growing token budgets.<n>We show that extended pre-training can make models harder to fine-tune, leading to degraded final performance.
arXiv Detail & Related papers (2025-03-24T23:11:56Z) - Adaptive Epsilon Adversarial Training for Robust Gravitational Wave Parameter Estimation Using Normalizing Flows [2.4184866684341473]
Adrial training with Normalizing Flow (NF) models is an emerging research area aimed at improving model robustness through adversarial samples.<n>We propose an adaptive epsilon method for Fast Gradient Sign Method (FGSM) adversarial training, which dynamically adjusts perturbation strengths based on gradient magnitudes using logarithmic scaling.<n>Our hybrid architecture, combining ResNet and Inverse Autoregressive Flow, reduces the Negative Log Likelihood loss by 47% under FGSM attacks compared to the baseline model.<n>Under stronger Projected Gradient Descent attacks with perturbation strength of 0.05, our model maintains an NLL of 6.4, demonstrating superior robustness while avoiding
arXiv Detail & Related papers (2024-12-10T14:48:59Z) - Stable Language Model Pre-training by Reducing Embedding Variability [29.698610741413045]
We explore Token Embedding Variability (TEV) as a proxy for assessing pre-training stability in language models.
We also propose Multi-head Low-Rank Attention (MLRA) as an architecture to alleviate such instability.
Empirical results on GPT-2 with MLRA demonstrate increased stability and lower perplexity, particularly in deeper models.
arXiv Detail & Related papers (2024-09-12T06:37:46Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.