Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning
- URL: http://arxiv.org/abs/2510.15514v2
- Date: Tue, 21 Oct 2025 03:35:55 GMT
- Title: Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning
- Authors: Boyin Liu, Zhuo Zhang, Sen Huang, Lipeng Xie, Qingxu Fu, Haoran Chen, LI YU, Tianyi Hu, Zhaoyang Liu, Bolin Ding, Dongbin Zhao,
- Abstract summary: This work introduces an end to end framework to detect and resolve inconsistencies within the reinforcement learning training loop.<n>Our framework features two core contributions: the Conflict Detection Rate (CDR), a novel metric to quantify judgment conflicts, and Deconflicted Graph Rewards (DGR), a signal-purification framework.<n>Experiments confirm that our framework significantly improves training stability and model performance over strong baselines.
- Score: 46.661195064495
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the critical issue of logical coherence particularly preference cycles has been largely unaddressed. To address this gap, this work introduces an end to end framework to systematically detect and resolve these inconsistencies within the reinforcement learning training loop. Our framework features two core contributions: the Conflict Detection Rate (CDR), a novel metric to quantify judgment conflicts, and Deconflicted Graph Rewards (DGR), a signal-purification framework that eliminates cycles before policy optimization. DGR constructs preference graphs from raw judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal compatible with any policy optimizer. Experiments confirm that our framework significantly improves training stability and model performance over strong baselines, establishing logical consistency as a crucial and now-addressable dimension of AI feedback. The code for our method is available at https://github.com/modelscope/RM-Gallery.
Related papers
- R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging [69.96389360650072]
We show that reasoning fidelity is highly predictive of downstream RLHF outcomes, beyond standard label accuracy.<n>We propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment.
arXiv Detail & Related papers (2026-02-06T15:17:11Z) - ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval [64.14282916266998]
Composed Image Retrieval aims to retrieve target images based on a hybrid query comprising a reference image and a modification text.<n>We propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline.<n>Experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-02T04:52:54Z) - CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs [53.199517625701475]
CoG is a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation.<n>CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2026-01-16T07:27:40Z) - No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning [21.237273221334963]
ECHO is a framework that jointly optimize the policy and critic through a synchronized co-evolutionary loop.<n>ECHO yields more stable training and higher long-horizon task success across open-world environments.
arXiv Detail & Related papers (2026-01-11T07:29:08Z) - Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment [5.900494456937422]
Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences.<n>This paper investigates a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration.
arXiv Detail & Related papers (2025-12-10T00:52:21Z) - Resolving Conflicts in Lifelong Learning via Aligning Updates in Subspaces [12.630494786258842]
Low-Rank Adaptation (LoRA) enables efficient Continual Learning but often suffers from catastrophic forgetting.<n>We propose PS-LoRA, a framework designed to resolve conflicts by aligning updates within the optimization subspace.<n>Our approach employs a dual-regularization objective that penalizes conflicting directions and constrains magnitude deviations to ensure consistency with prior knowledge.
arXiv Detail & Related papers (2025-11-28T15:34:36Z) - Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization [72.30168853571216]
multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning.<n>CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories.
arXiv Detail & Related papers (2025-09-26T04:32:26Z) - RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z) - Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - Towards Robust Recommendation via Decision Boundary-aware Graph Contrastive Learning [25.514007761856632]
graph contrastive learning (GCL) has received increasing attention in recommender systems due to its effectiveness in reducing bias caused by data sparsity.
We argue that these methods struggle to balance between semantic invariance and view hardness across the dynamic training process.
We propose a novel GCL-based recommendation framework RGCL, which effectively maintains the semantic invariance of contrastive pairs and dynamically adapts as the model capability evolves.
arXiv Detail & Related papers (2024-07-14T13:03:35Z) - HC-Ref: Hierarchical Constrained Refinement for Robust Adversarial
Training of GNNs [7.635985143883581]
Adversarial training, which has been shown to be one of the most effective defense mechanisms against adversarial attacks in computer vision, holds great promise for enhancing the robustness of GNNs.
We propose a hierarchical constraint refinement framework (HC-Ref) that enhances the anti-perturbation capabilities of GNNs and downstream classifiers separately.
arXiv Detail & Related papers (2023-12-08T07:32:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.