It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs
- URL: http://arxiv.org/abs/2506.00486v3
- Date: Wed, 04 Jun 2025 08:00:08 GMT
- Title: It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs
- Authors: Jun Wu, Yirong Xiong, Jiangtao Wen, Yuxing Han,
- Abstract summary: We introduce BackSlash, a training-time compression algorithm for large language models.<n>We propose a unified, end-to-end framework for LLM optimization based on the GG model.<n>Our contributions are threefold:.<n>DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile,.<n>RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-priord BackSlash training.
- Score: 15.263422862969803
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite rapid advancements in the research and deployment of large language models (LLMs), the statistical distribution of model parameters, as well as their influence on initialization, training dynamics, and downstream efficiency, has received surprisingly little attention. A recent work introduced BackSlash, a training-time compression algorithm. It first demonstrated that pre-trained LLM parameters follow generalized Gaussian distributions (GGDs) better. By optimizing GG priors during training, BackSlash can reduce parameters by up to 90\% with minimal performance loss. Building on this foundational insight, we propose a unified, end-to-end framework for LLM optimization based on the GG model. Our contributions are threefold: (1) GG-based initialization scheme that aligns with the statistical structure of trained models, resulting in faster convergence and improved accuracy; (2) DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile, improving compressibility with minimized degradation in performance; and (3) RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-initialized BackSlash training, enabling low-cost inference without compromising accuracy. Experiments across diverse model architectures show that our framework consistently yields smaller and faster models that match or outperform standard training baselines. By grounding LLM development in principled statistical modeling, this work forges a new path toward efficient, scalable, and hardware-aware AI systems. The code is available on our project page: https://huggingface.co/spaces/shifeng3711/gg_prior.
Related papers
- Layer-wise LoRA fine-tuning: a similarity metric approach [0.6323908398583081]
Low-Rank Adaptation (LoRA) techniques aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters.<n>We address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants.<n>We reduce the trainable parameters in LoRA-based techniques by up to 50%, while maintaining the predictive performance across different models and tasks.
arXiv Detail & Related papers (2026-02-05T18:38:53Z) - POME: Post Optimization Model Edit via Muon-style Projection [74.73326657229347]
Post-Optimization Model Edit (POME) enhances the performance of fine-tuned large language models.<n>It uses a muon-style projection to $Delta W$, the difference between the fine-tuned and pretrained weights.<n>As a simple post-processing step, POME is completely decoupled from the training pipeline.
arXiv Detail & Related papers (2025-10-08T04:20:11Z) - CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models [27.682531424487564]
Sparsity-aware training is an effective approach for transforming large language models into hardware-friendly sparse patterns.<n>We propose Continuous Adaptive Sparse Trainer (CAST), a continuous and differentiable sparsity-aware training framework for sparse models.<n>Our results demonstrate significant improvements over previous state-of-the-art methods in both perplexity and zero-shot accuracy with minimal training resources.
arXiv Detail & Related papers (2025-09-30T09:28:47Z) - Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models [0.41942958779358663]
We propose a predictive framework that models training dynamics and helps optimize resource usage.<n>We derive an empirical scaling law based on model size, initial performance, and training progress.<n>We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance.
arXiv Detail & Related papers (2025-07-24T01:09:25Z) - Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach [65.6966065843227]
Iterative Reweight-then-IRO is a framework that performs RL-style alignment of a frozen base model without touching its parameters.<n>At test time, the value functions are used to guide the base model generation via a search-based optimization process.<n> Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT)
arXiv Detail & Related papers (2025-06-21T21:49:02Z) - Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z) - Shadow-FT: Tuning Instruct Model via Training on Paired Base Model [67.20706292627106]
Large language models (LLMs) consistently benefit from further fine-tuning on various tasks.<n>We propose a novel Shadow-FT framework to tune the Instruct models by leveraging the corresponding Base models.<n>Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance.
arXiv Detail & Related papers (2025-05-19T05:16:21Z) - Optimizing ML Training with Metagradient Descent [69.89631748402377]
We introduce an algorithm for efficiently calculating metagradients -- gradients through model training -- at scale.<n>We then introduce a "smooth model training" framework that enables effective optimization using metagradients.
arXiv Detail & Related papers (2025-03-17T22:18:24Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers [16.253898272659242]
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive.
Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs)
We show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.
arXiv Detail & Related papers (2024-06-24T08:43:21Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Improving generalization in large language models by learning prefix
subspaces [5.911540700785975]
This article focuses on large language models (LLMs) fine-tuning in the scarce data regime (also known as the "few-shot" learning setting)
We propose a method to increase the generalization capabilities of LLMs based on neural network subspaces.
arXiv Detail & Related papers (2023-10-24T12:44:09Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.