Related papers: CAMEL: Confidence-Gated Reflection for Reward Modeling

CAMEL: Confidence-Gated Reflection for Reward Modeling

URL: http://arxiv.org/abs/2602.20670v1
Date: Tue, 24 Feb 2026 08:20:08 GMT
Title: CAMEL: Confidence-Gated Reflection for Reward Modeling
Authors: Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu, Yang You,
Abstract summary: We propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first.<n>We train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision.<n> Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy.
Score: 26.908515245229747
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.

Related papers

Observationally Informed Adaptive Causal Experimental Design [55.998153710215654]
We propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior.<n>This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias.<n> Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines.
arXiv Detail & Related papers (2026-03-04T06:52:37Z)
EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs [9.412828452977553]
Existing approaches reinforce successful reasoning paths, incurring a substantial calibration cost.<n>This failure has been characterized as a form of model collapse in alignment.<n>We proposeEpiCaR as a training objective that jointly optimize reasoning performance and calibration.
arXiv Detail & Related papers (2026-01-11T06:21:13Z)
Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models [63.00458229517523]
This work addresses the evaluation challenge of reward models by probing preference representations.<n>We construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions.<n>We introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability.
arXiv Detail & Related papers (2025-11-16T05:29:29Z)
Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness [12.513874407270142]
We present a systematic comparison of "thinking" and "non-thinking" Large Language Models (LLMs)<n>We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks.<n>Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead.
arXiv Detail & Related papers (2025-09-09T18:36:02Z)
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future [38.1810626252963]
Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting.<n>We propose textbf Self-Rewarding Language Models that strategically coordinate past, present, and future model generations to sustain learning signals.
arXiv Detail & Related papers (2025-08-08T05:25:54Z)
Bi-directional Model Cascading with Proxy Confidence [3.1890398692194326]
We propose a bi-directional approach to deferral that considers the confidence of small and large models in the cascade simultaneously.<n>We use an analysis of hidden states to improve post-invocation confidence of the small model.<n>We then combine this with a tiny proxy model to estimate pre-invocation confidence of the large model.
arXiv Detail & Related papers (2025-04-27T23:48:14Z)
Language Model Preference Evaluation with Multiple Weak Evaluators [89.90733463933431]
We introduce PGED, a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results.<n>We demonstrate PGED's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning.
arXiv Detail & Related papers (2024-10-14T01:57:25Z)
Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs) We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback. Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.