Related papers: IMU-1: Sample-Efficient Pre-training of Small Language Models

IMU-1: Sample-Efficient Pre-training of Small Language Models

URL: http://arxiv.org/abs/2602.02522v1
Date: Sun, 25 Jan 2026 21:24:15 GMT
Title: IMU-1: Sample-Efficient Pre-training of Small Language Models
Authors: George Grigorev,
Abstract summary: We present IMU-1, a 430M- parameter language model trained on 72B tokens that approaches the benchmark performance of models trained on 56x more data.<n>We describe a validated training recipe combining recent architectural interventions (QK-norm attention, per-head gating, value residuals, LayerNorm scaling) with optimization advances.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present IMU-1, a 430M-parameter language model trained on 72B tokens that approaches the benchmark performance of models trained on 56x more data. We describe a validated training recipe combining recent architectural interventions (QK-norm attention, per-head gating, value residuals, LayerNorm scaling) with optimization advances (NorMuon with cautious weight decay, muP parametrization) and a three-stage training schedule with post-hoc checkpoint EMA. We provide ablations for each component and release code, weights and data to enable reproduction: https://huggingface.co/thepowerfuldeez/imu1_base

Related papers

Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain [0.0]
This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain.<n>We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens; and (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning.
arXiv Detail & Related papers (2026-01-22T14:41:32Z)
Diffusion Language Models are Super Data Learners [61.721441061210896]
When unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs.<n>We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation.
arXiv Detail & Related papers (2025-11-05T08:17:42Z)
CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models [27.682531424487564]
Sparsity-aware training is an effective approach for transforming large language models into hardware-friendly sparse patterns.<n>We propose Continuous Adaptive Sparse Trainer (CAST), a continuous and differentiable sparsity-aware training framework for sparse models.<n>Our results demonstrate significant improvements over previous state-of-the-art methods in both perplexity and zero-shot accuracy with minimal training resources.
arXiv Detail & Related papers (2025-09-30T09:28:47Z)
DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training [28.02129783121819]
DreamPRM-1.5 is an instance-level reweighting framework that assigns an adaptive weight to every training example via bi-level optimization.<n>It attains 84.6 accuracy on the MMMU validation set, 31.3 accuracy on R-Bench-V and, when paired with a leading backbone, achieves first-place results on public multimodal reasoning leaderboards.
arXiv Detail & Related papers (2025-09-05T23:42:01Z)
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining [60.02032710118597]
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages.<n>MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed.<n>The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini.
arXiv Detail & Related papers (2025-05-12T14:30:11Z)
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z)
Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes. Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts. We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z)
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model. We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO) The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z)
LogME: Practical Assessment of Pre-trained Models for Transfer Learning [80.24059713295165]
The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning. Compared to brute-force fine-tuning, LogME brings over $3000times$ speedup in wall-clock time.
arXiv Detail & Related papers (2021-02-22T13:58:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.