Related papers: Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B

URL: http://arxiv.org/abs/2511.06221v1
Date: Sun, 09 Nov 2025 04:37:36 GMT
Title: Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B
Authors: Sen Xu, Yi Zhou, Wei Wang, Jixin Min, Zhibin Yin, Yingwei Dai, Shixi Liu, Lianyu Pang, Yirong Chen, Junlin Zhang,
Abstract summary: This report introduces VibeThinker-1.5B, a 1.5B- parameter dense model developed via our Spectrum-to-Signal Principle (SSP)<n>With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models.<n>Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks.
Score: 12.229008422568192
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (>1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium's 50.3 and its base model's 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.

Related papers

Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models [0.0]
We present a controlled study of multi-hop contextual reasoning in large language models.<n>We show that multi-agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule-based methods fail.
arXiv Detail & Related papers (2026-01-06T20:18:55Z)
Skywork Open Reasoner 1 Technical Report [51.403686909760914]
We present Skywork-OR1, an effective and scalable reinforcement learning (RL) implementation for long Chain-of-Thought (CoT) models.<n>Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains.<n>Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks.
arXiv Detail & Related papers (2025-05-28T12:56:04Z)
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning [231.11339402237903]
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding.<n>Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA.<n>It demonstrates excellent reasoning abilities in STEM and coding.
arXiv Detail & Related papers (2025-04-10T17:10:51Z)
An Empirical Study on Eliciting and Improving R1-like Reasoning Models [90.52239241349504]
scaling RL training has become a central technique for implementing such reasoning models.<n>We demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models.<n>We also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models.
arXiv Detail & Related papers (2025-03-06T15:34:27Z)
S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation.<n>S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z)
Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning [2.1036545320600095]
DeepSeek-R1 is an open-source reasoning model against OpenAI's GPT-4o and GPT-4o-mini.<n>Our experiments show DeepSeek-R1 achieves a 91.39% F1 score on 5-class sentiment and 99.31% accuracy on binary tasks with just 5 shots.
arXiv Detail & Related papers (2025-02-03T07:17:46Z)
SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs [0.0]
We present SmolTulu, an instruction-tuned language model that adapts AllenAI's Tulu 3 post-training pipeline to enhance Huggingface's SmolLM2-1.7B base model.<n>Our findings reveal a clear split: reasoning tasks like ARC and GSM8K benefit from higher learning rate to batch size ratios, while pattern recognition tasks such as HellaSwag and IFEval show optimal performance with lower ratios.
arXiv Detail & Related papers (2024-12-11T12:41:36Z)
GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing. We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.