Test-Time Scaling with Reflective Generative Model
- URL: http://arxiv.org/abs/2507.01951v2
- Date: Wed, 09 Jul 2025 12:28:31 GMT
- Title: Test-Time Scaling with Reflective Generative Model
- Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie,
- Abstract summary: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form.<n>Our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size.
- Score: 56.681092287505564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
Related papers
- Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach [65.6966065843227]
Iterative Reweight-then-IRO is a framework that performs RL-style alignment of a frozen base model without touching its parameters.<n>At test time, the value functions are used to guide the base model generation via a search-based optimization process.<n> Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT)
arXiv Detail & Related papers (2025-06-21T21:49:02Z) - R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents [32.06393076572057]
AgentGym is the largest procedurally-curated executable gym environment for training real-world SWE-agents.<n>It is powered by two main contributions: SYNGEN, a synthetic data curation recipe, and Hybrid Test-time Scaling.<n>Overall, our approach achieves 51% on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-agents.
arXiv Detail & Related papers (2025-04-09T17:55:19Z) - GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning [35.429904556288996]
We introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification.<n> Experimental results show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset.
arXiv Detail & Related papers (2025-04-01T15:21:05Z) - Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
Generalist Virtual Agents (GVAs) have shown significant promise in autonomous task execution.<n>To address these challenges, we propose Similar, a Step-Wise Multi-Dimensional Generalist Reward Model.<n>Similar offers fine-grained signals for agent training and can choose better action for inference-time scaling.
arXiv Detail & Related papers (2025-03-24T13:30:47Z) - START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM.<n> START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging.<n>It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z) - S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation.<n>S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z) - PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection [65.84604846389624]
We propose PointOBB-v3, a stronger single point-supervised OOD framework.<n>It generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm.<n>Our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods.
arXiv Detail & Related papers (2025-01-23T18:18:15Z) - A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets.
We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z) - Data-Driven Approaches for Modelling Target Behaviour [1.5495593104596401]
The performance of tracking algorithms depends on the chosen model assumptions regarding the target dynamics.
This paper provides a comparative study between three different methods that use machine learning to describe the underlying object motion.
arXiv Detail & Related papers (2024-10-14T14:18:27Z) - Guided Self-attention: Find the Generalized Necessarily Distinct Vectors for Grain Size Grading [11.220653004059304]
We propose a novel classifi-cation method based on deep learning, namely GSNets, which can effectively introduce guided self-attention for classifying grain size.
Experiments show that our GSNet yields a classifi-cation accuracy of 90.1%, surpassing the state-of-the-art Swin Transformer V2 by 1.9%.
We intuitively believe our approach is applicable to broader ap-plications like object detection and semantic segmentation.
arXiv Detail & Related papers (2024-10-08T07:40:31Z) - Subequivariant Graph Reinforcement Learning in 3D Environments [34.875774768800966]
We propose a novel setup for morphology-agnostic RL, dubbed Subequivariant Graph RL in 3D environments.
Specifically, we first introduce a new set of more practical yet challenging benchmarks in 3D space.
To optimize the policy over the enlarged state-action space, we propose to inject geometric symmetry.
arXiv Detail & Related papers (2023-05-30T11:34:57Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To
Benchmark [14.50261153230204]
We focus on Multimodal Machine Reading (M3C) where a model is expected to answer questions based on given passage (or context)
We identify three critical biases stemming from the question-answer generation process and memorization capabilities of large deep models.
We propose a systematic framework to address these biases through three Control-Knobs.
arXiv Detail & Related papers (2021-10-22T16:33:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.