Related papers: CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

URL: http://arxiv.org/abs/2601.20327v2
Date: Sat, 31 Jan 2026 12:28:48 GMT
Title: CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria
Authors: Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin, Jiashun Liu, Wenbo Su, Bo Zheng, Xiaojun Wan,
Abstract summary: We propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method.<n>Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks.
Score: 48.70940362676624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.

Related papers

EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization [48.27039405295434]
EffiEval is a training-free approach for efficient benchmarking that addresses data redundancy while maintaining high evaluation reliability.<n>Our method is specifically designed to meet three key criteria for high-quality evaluation: representativeness, fairness, and generalizability.<n>EffiEval achieves strong ranking consistency with full-dataset evaluation using only a small fraction of the original data.
arXiv Detail & Related papers (2025-08-13T09:48:23Z)
User-centric Subjective Leaderboard by Customizable Reward Modeling [34.40455169451943]
We present the first User-Centric Subjective Leaderboard (USL)<n>It provides a preference-driven, dynamic ranking of large language models (LLMs) across diverse real-world scenarios.<n>Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries.
arXiv Detail & Related papers (2025-08-13T03:39:04Z)
EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation [17.37840331449749]
We propose a self-Evolving Pairwise Reasoning (EvolvR) framework for story evaluation.<n>The framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy.<n>The evaluator trained on the refined data is deployed as a reward model to guide the story generation task.
arXiv Detail & Related papers (2025-08-08T06:10:47Z)
Posterior-GRPO: Rewarding Reasoning Processes in Code Generation [11.474187778340012]
Reinforcement learning has significantly advanced code generation for large language models.<n>Current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process.<n>We introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL.
arXiv Detail & Related papers (2025-08-07T09:04:10Z)
RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z)
Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling [27.86204841898399]
Reward modeling in large language models is susceptible to reward hacking.<n>We propose Context-Aware Reward Modeling (CARMO) to mitigate this problem.<n>We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1% improvement on Reward Bench.
arXiv Detail & Related papers (2024-10-28T21:18:49Z)
Language Model Preference Evaluation with Multiple Weak Evaluators [89.90733463933431]
We introduce PGED, a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results.<n>We demonstrate PGED's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning.
arXiv Detail & Related papers (2024-10-14T01:57:25Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.