Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
- URL: http://arxiv.org/abs/2506.09532v2
- Date: Tue, 23 Sep 2025 05:57:54 GMT
- Title: Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
- Authors: Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum,
- Abstract summary: We present Athena-PRM, a process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems.<n>Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios.
- Score: 19.924422958846144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.
Related papers
- CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent [46.41047559759938]
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces.<n> Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored.<n>We present CUARewardBench, comprising four key contributions.
arXiv Detail & Related papers (2025-10-21T12:53:40Z) - From Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process Supervision [49.59309446816251]
Existing methods estimate the quality of reasoning steps based on a fixed-budget sampling strategy.<n>We propose Adaptive Monte Carlo Search (AMCS), a framework that transforms data generation from fixed, static to adaptive.<n>AMCS adaptively refines estimation by allocating more samples to uncertain reasoning steps while using fewer samples for those that are easier to estimate.
arXiv Detail & Related papers (2025-09-29T06:52:35Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning [25.817231106021552]
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks.<n>However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning.<n>In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning.
arXiv Detail & Related papers (2025-04-21T17:59:02Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.<n>We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - ViLBench: A Suite for Vision-Language Process Reward Modeling [25.565912785217822]
This paper first benchmarks current vision large language models (VLLMs) as two types of reward models.<n>We introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals.<n>We preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models.
arXiv Detail & Related papers (2025-03-26T06:38:31Z) - VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z) - Better Process Supervision with Bi-directional Rewarding Signals [41.929678717412266]
We introduce BiRM, a novel process supervision model that evaluates correctness of previous steps and also models the probability of future success.<n>We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps.<n>In search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
arXiv Detail & Related papers (2025-03-06T17:03:17Z) - Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [90.23629291067763]
A promising approach for improving reasoning in large language models is to use process reward models (PRMs)
PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs)
To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?"
We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL.
arXiv Detail & Related papers (2024-10-10T17:31:23Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.