Related papers: Test-Time Scaling with Reflective Generative Model

Test-Time Scaling with Reflective Generative Model

URL: http://arxiv.org/abs/2507.01951v2
Date: Wed, 09 Jul 2025 12:28:31 GMT
Title: Test-Time Scaling with Reflective Generative Model
Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie,
Abstract summary: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form.<n>Our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size.
Score: 56.681092287505564
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.

Related papers

Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach [65.6966065843227]
Iterative Reweight-then-IRO is a framework that performs RL-style alignment of a frozen base model without touching its parameters.<n>At test time, the value functions are used to guide the base model generation via a search-based optimization process.<n> Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT)
arXiv Detail & Related papers (2025-06-21T21:49:02Z)
R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents [32.06393076572057]
AgentGym is the largest procedurally-curated executable gym environment for training real-world SWE-agents.<n>It is powered by two main contributions: SYNGEN, a synthetic data curation recipe, and Hybrid Test-time Scaling.<n>Overall, our approach achieves 51% on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-agents.
arXiv Detail & Related papers (2025-04-09T17:55:19Z)
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning [35.429904556288996]
We introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification.<n> Experimental results show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset.
arXiv Detail & Related papers (2025-04-01T15:21:05Z)
Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
Generalist Virtual Agents (GVAs) have shown significant promise in autonomous task execution.<n>To address these challenges, we propose Similar, a Step-Wise Multi-Dimensional Generalist Reward Model.<n>Similar offers fine-grained signals for agent training and can choose better action for inference-time scaling.
arXiv Detail & Related papers (2025-03-24T13:30:47Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM.<n> START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging.<n>It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation.<n>S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z)
PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection [65.84604846389624]
We propose PointOBB-v3, a stronger single point-supervised OOD framework.<n>It generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm.<n>Our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods.
arXiv Detail & Related papers (2025-01-23T18:18:15Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
Data-Driven Approaches for Modelling Target Behaviour [1.5495593104596401]
The performance of tracking algorithms depends on the chosen model assumptions regarding the target dynamics. This paper provides a comparative study between three different methods that use machine learning to describe the underlying object motion.
arXiv Detail & Related papers (2024-10-14T14:18:27Z)
Guided Self-attention: Find the Generalized Necessarily Distinct Vectors for Grain Size Grading [11.220653004059304]
We propose a novel classifi-cation method based on deep learning, namely GSNets, which can effectively introduce guided self-attention for classifying grain size. Experiments show that our GSNet yields a classifi-cation accuracy of 90.1%, surpassing the state-of-the-art Swin Transformer V2 by 1.9%. We intuitively believe our approach is applicable to broader ap-plications like object detection and semantic segmentation.
arXiv Detail & Related papers (2024-10-08T07:40:31Z)
Subequivariant Graph Reinforcement Learning in 3D Environments [34.875774768800966]
We propose a novel setup for morphology-agnostic RL, dubbed Subequivariant Graph RL in 3D environments. Specifically, we first introduce a new set of more practical yet challenging benchmarks in 3D space. To optimize the policy over the enlarged state-action space, we propose to inject geometric symmetry.
arXiv Detail & Related papers (2023-05-30T11:34:57Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To Benchmark [14.50261153230204]
We focus on Multimodal Machine Reading (M3C) where a model is expected to answer questions based on given passage (or context) We identify three critical biases stemming from the question-answer generation process and memorization capabilities of large deep models. We propose a systematic framework to address these biases through three Control-Knobs.
arXiv Detail & Related papers (2021-10-22T16:33:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.