Compass-Thinker-7B Technical Report
- URL: http://arxiv.org/abs/2508.08909v2
- Date: Thu, 14 Aug 2025 07:12:38 GMT
- Title: Compass-Thinker-7B Technical Report
- Authors: Anxiang Zeng, Haibo Zhang, Kaixiang Mo, Long Zhang, Shuman Liu, Yanhui Huang, Yawen Liu, Yuepeng Sheng, Yuwei Huang,
- Abstract summary: We propose the Compass-Thinker-7B model to explore the potential of Reinforcement Learning with less computational resources and costs.<n> Compass-Thinker-7B is trained from an open source model through a specially designed Reinforcement Learning Pipeline.<n>We show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL model.
- Score: 8.496143273813718
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent R1-Zero-like research further demonstrates that reasoning extension has given large language models (LLMs) unprecedented reasoning capabilities, and Reinforcement Learning is the core technology to elicit its complex reasoning. However, conducting RL experiments directly on hyperscale models involves high computational costs and resource demands, posing significant risks. We propose the Compass-Thinker-7B model, which aims to explore the potential of Reinforcement Learning with less computational resources and costs, and provides insights for further research into RL recipes for larger models. Compass-Thinker-7B is trained from an open source model through a specially designed Reinforcement Learning Pipeline. We curate a dataset of 30k verifiable mathematics problems for the Reinforcement Learning Pipeline. By configuring data and training settings with different difficulty distributions for different stages, the potential of the model is gradually released and the training efficiency is improved. Extensive evaluations show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL model. Especially in the challenging AIME2024 evaluation, Compass-Thinker-7B achieves 40% accuracy.
Related papers
- RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning [37.197149670957394]
Stratified-based Curriculum Learning (RuCL) is a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design.<n>RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence.<n>Experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model.
arXiv Detail & Related papers (2026-02-25T06:46:24Z) - D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use [17.99381644283042]
Large reasoning models (LRMs) lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning.<n>We propose a two-stage training framework that incentivizes LRMs' task decomposition reasoning capability via self-distillation and diversity-aware reinforcement learning.<n>D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales.
arXiv Detail & Related papers (2026-02-02T14:36:15Z) - Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents [36.29651446001057]
We show that small language models can achieve strong multi-hop reasoning under resource constraints.<n>We introduce DAVID-GRPO, a budget-efficient RL framework that stabilizes early learning with minimal supervision.
arXiv Detail & Related papers (2026-01-29T13:31:28Z) - Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models [49.911784762244814]
TraceRL is a trajectory-aware reinforcement learning framework for diffusion language models (DLMs)<n>We derive a series of state-of-the-art diffusion language models, namely TraDo.<n>TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-09-08T17:58:06Z) - RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - Concise Reasoning via Reinforcement Learning [13.657506042120167]
We revisit the core principles of reinforcement learning (RL)<n>We uncover a natural correlation between conciseness and accuracy that has been largely overlooked.<n>We show that introducing a secondary phase of RL training, using a very small set of problems, can significantly reduce chains of thought.
arXiv Detail & Related papers (2025-04-07T15:35:54Z) - An Empirical Study on Eliciting and Improving R1-like Reasoning Models [90.52239241349504]
scaling RL training has become a central technique for implementing such reasoning models.<n>We demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models.<n>We also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models.
arXiv Detail & Related papers (2025-03-06T15:34:27Z) - On the Diminishing Returns of Complex Robust RAG Training in the Era of Powerful LLMs [85.688901949146]
We investigate the question: does the benefit of complex robust training methods diminish as language models become more powerful?<n>Our analysis reveals a consistent trend: emphthe marginal robustness benefit of sophisticated training strategies decreases substantially as model capacity increases.<n>Further investigation demonstrates that stronger models naturally exhibit better confidence calibration, cross-dataset generalization capability, and more effective attention patterns, even under simple training regimes.
arXiv Detail & Related papers (2025-02-17T03:34:31Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.