Motif 2 12.7B technical report
- URL: http://arxiv.org/abs/2511.07464v1
- Date: Wed, 12 Nov 2025 01:01:00 GMT
- Title: Motif 2 12.7B technical report
- Authors: Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon,
- Abstract summary: We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models.<n>Model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains.<n>Post-training employs a three-stage supervised fine-tuning pipeline that enhances general instruction adherence, compositional understanding, and linguistic precision.
- Score: 8.084150960631142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.
Related papers
- DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z) - Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes [7.998815625852598]
We introduce a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding.<n>Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations.<n>We detail a robust Reinforcement Learning Fine-Tuning pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse.
arXiv Detail & Related papers (2025-12-11T00:51:18Z) - Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models [49.911784762244814]
TraceRL is a trajectory-aware reinforcement learning framework for diffusion language models (DLMs)<n>We derive a series of state-of-the-art diffusion language models, namely TraDo.<n>TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-09-08T17:58:06Z) - PLaMo 2 Technical Report [9.166942912957724]
We introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture.<n>PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.
arXiv Detail & Related papers (2025-09-05T08:17:59Z) - Bielik v3 Small: Technical Report [0.0]
We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing.<n>These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts.
arXiv Detail & Related papers (2025-05-05T10:39:51Z) - Bielik 11B v2 Technical Report [0.0]
Bielik 11B v2 is a state-of-the-art language model optimized for Polish text processing.<n>It is built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling.<n>We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss and Adaptive Learning Rate.
arXiv Detail & Related papers (2025-05-05T07:03:41Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.