Training Data Efficiency in Multimodal Process Reward Models
- URL: http://arxiv.org/abs/2602.04145v2
- Date: Thu, 05 Feb 2026 03:46:39 GMT
- Title: Training Data Efficiency in Multimodal Process Reward Models
- Authors: Jinyuan Li, Chengsong Huang, Langlin Huang, Shaoyang Xu, Haolin Liu, Wenxuan Zhang, Jiaxin Huang,
- Abstract summary: Training MPRMs requires large-scale Monte Carlo (MC)-annotated corpora.<n>This paper studies the data efficiency for MPRM training.<n>We propose the Balanced-Information Score (BIS) which prioritizes both mixture and reliability based on existing MC signals.
- Score: 33.13249650453014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training. Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora. To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
Related papers
- SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning [76.61439010634872]
Process reward models (PRMs) facilitate deeper reasoning processes in large language models (LLMs)<n>PRMs are challenging to develop due to the high cost and limited scalability of human-annotated data.<n>We propose Self-Denoising Monte Carlo CAN (SCAN), an efficient data synthesis and noise-tolerant learning framework.
arXiv Detail & Related papers (2025-09-20T06:19:55Z) - DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training [28.02129783121819]
DreamPRM-1.5 is an instance-level reweighting framework that assigns an adaptive weight to every training example via bi-level optimization.<n>It attains 84.6 accuracy on the MMMU validation set, 31.3 accuracy on R-Bench-V and, when paired with a leading backbone, achieves first-place results on public multimodal reasoning leaderboards.
arXiv Detail & Related papers (2025-09-05T23:42:01Z) - Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z) - VRPRM: Process Reward Modeling via Visual Reasoning [25.04579441819971]
We propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy.<n>Using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K.
arXiv Detail & Related papers (2025-08-05T15:25:24Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [75.72672339168092]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z) - MLMC-based Resource Adequacy Assessment with Active Learning Trained Surrogate Models [6.430258446597413]
Multilevel Monte Carlo (MLMC) is a flexible and effective variance technique for accelerating reliability assessments.<n>Data-driven surrogate models have been proposed as lower-level models in complex power system framework.
arXiv Detail & Related papers (2025-05-27T09:21:02Z) - Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment [35.80989342492335]
noisy preferences in human feedback can lead to reward misgeneralization.<n>This paper aims to identify how noisy preferences differ from human-aligned preferences in reward modeling.<n>We propose an online Collaborative Reward Modeling framework to achieve robust preference learning.
arXiv Detail & Related papers (2025-05-15T10:58:20Z) - Balancing Multimodal Training Through Game-Theoretic Regularization [26.900302082724295]
Multimodal learning holds promise for richer information extraction by capturing dependencies across data sources.<n>Yet, current training methods often underperform due to modality competition.<n>This paper proposes the Multimodal Competition Regularizer (MCR), inspired by a mutual information (MI) decomposition.
arXiv Detail & Related papers (2024-11-11T19:53:05Z) - Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.