Training Language Model to Critique for Better Refinement
- URL: http://arxiv.org/abs/2506.22157v1
- Date: Fri, 27 Jun 2025 12:10:57 GMT
- Title: Training Language Model to Critique for Better Refinement
- Authors: Tianshu Yu, Chao Xiang, Mingchuan Yang, Pei Ke, Bosi Wen, Cunxiang Wang, Jiale Cheng, Li Zhang, Xinyu Mu, Chuxiong Sun, Minlie Huang,
- Abstract summary: We introduce textbfRefinement-oriented textbfCritique textbfOptimization (RCO), a novel framework designed to train critic models using refinement signals.<n>RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses.<n>By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment.
- Score: 58.73039433159486
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce \textbf{R}efinement-oriented \textbf{C}ritique \textbf{O}ptimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks, i.e., dialog generation, summarization, question answering, mathematical reasoning, and code generation, and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method's effectiveness in enhancing LLM critique-refinement loops.
Related papers
- RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback [57.967762383794806]
RefCritic is a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards.<n>We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks.
arXiv Detail & Related papers (2025-07-20T16:19:51Z) - Teaching Language Models to Critique via Reinforcement Learning [59.36253627145115]
We show that critics trained with $textttCTRL$ significantly enhance pass rates and mitigate errors across both base and stronger generator models.<n>We also show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision.
arXiv Detail & Related papers (2025-02-05T02:18:46Z) - RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs)<n>Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z) - Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT (Self-evolving CRITic) is a framework that enables genuine self-evolution of critique abilities.<n>It self-improves by training on synthetic data, generated by a contrastive-based self-critic.<n>It achieves up to a 10.3% improvement on critique-correction and error identification benchmarks.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning [112.35483894933904]
We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs.<n>VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought.<n>LookBack significantly improves critique and correction performance by up to 13.5%.
arXiv Detail & Related papers (2024-12-03T05:04:49Z) - Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks [68.49251303172674]
State-of-the-art large language models (LLMs) exhibit impressive problem-solving capabilities but may struggle with complex reasoning and factual correctness.
Existing methods harness the strengths of chain-of-thought and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness.
We introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning.
arXiv Detail & Related papers (2024-10-02T11:26:02Z) - Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoT is a framework that pushes LLMs toward System-2-like critic capability.<n>CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation.<n>Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance.
arXiv Detail & Related papers (2024-08-29T08:02:09Z) - Improving Reward Models with Synthetic Critiques [20.180933963110814]
Reward models (RMs) play a critical role in aligning language models through the process of reinforcement learning from human feedback.
We propose a novel approach using synthetic natural language critiques generated by large language models to provide additional feedback.
We demonstrate that high-quality critiques improve the performance and data efficiency of RMs from different pretrained models.
arXiv Detail & Related papers (2024-05-31T14:33:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.