IMU-1: Sample-Efficient Pre-training of Small Language Models
- URL: http://arxiv.org/abs/2602.02522v1
- Date: Sun, 25 Jan 2026 21:24:15 GMT
- Title: IMU-1: Sample-Efficient Pre-training of Small Language Models
- Authors: George Grigorev,
- Abstract summary: We present IMU-1, a 430M- parameter language model trained on 72B tokens that approaches the benchmark performance of models trained on 56x more data.<n>We describe a validated training recipe combining recent architectural interventions (QK-norm attention, per-head gating, value residuals, LayerNorm scaling) with optimization advances.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present IMU-1, a 430M-parameter language model trained on 72B tokens that approaches the benchmark performance of models trained on 56x more data. We describe a validated training recipe combining recent architectural interventions (QK-norm attention, per-head gating, value residuals, LayerNorm scaling) with optimization advances (NorMuon with cautious weight decay, muP parametrization) and a three-stage training schedule with post-hoc checkpoint EMA. We provide ablations for each component and release code, weights and data to enable reproduction: https://huggingface.co/thepowerfuldeez/imu1_base
Related papers
- Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain [0.0]
This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain.<n>We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens; and (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning.
arXiv Detail & Related papers (2026-01-22T14:41:32Z) - Diffusion Language Models are Super Data Learners [61.721441061210896]
When unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs.<n>We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation.
arXiv Detail & Related papers (2025-11-05T08:17:42Z) - CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models [27.682531424487564]
Sparsity-aware training is an effective approach for transforming large language models into hardware-friendly sparse patterns.<n>We propose Continuous Adaptive Sparse Trainer (CAST), a continuous and differentiable sparsity-aware training framework for sparse models.<n>Our results demonstrate significant improvements over previous state-of-the-art methods in both perplexity and zero-shot accuracy with minimal training resources.
arXiv Detail & Related papers (2025-09-30T09:28:47Z) - DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training [28.02129783121819]
DreamPRM-1.5 is an instance-level reweighting framework that assigns an adaptive weight to every training example via bi-level optimization.<n>It attains 84.6 accuracy on the MMMU validation set, 31.3 accuracy on R-Bench-V and, when paired with a leading backbone, achieves first-place results on public multimodal reasoning leaderboards.
arXiv Detail & Related papers (2025-09-05T23:42:01Z) - MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining [60.02032710118597]
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages.<n>MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed.<n>The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini.
arXiv Detail & Related papers (2025-05-12T14:30:11Z) - Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - LogME: Practical Assessment of Pre-trained Models for Transfer Learning [80.24059713295165]
The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning.
Compared to brute-force fine-tuning, LogME brings over $3000times$ speedup in wall-clock time.
arXiv Detail & Related papers (2021-02-22T13:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.