Related papers: CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

URL: http://arxiv.org/abs/2402.14809v4
Date: Sat, 1 Jun 2024 07:46:28 GMT
Title: CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
Authors: Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, Yujiu Yang,
Abstract summary: CriticBench is a benchmark designed to assess Large Language Models' abilities to critique and rectify their reasoning. We evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning.
Score: 26.45110574463893
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning. Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.

Related papers

DeepCritic: Deliberate Critique with Large Language Models [77.5516314477878]
We focus on studying and enhancing the math critique ability of Large Language Models (LLMs) Our developed critique model built on Qwen2.5-7B-Instruct significantly outperforms existing LLM critics on various error identification benchmarks.
arXiv Detail & Related papers (2025-05-01T17:03:17Z)
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs) Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z)
Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT (Self-evolving CRITic) is a framework that enables genuine self-evolution of critique abilities. It self-improves by training on synthetic data, generated by a contrastive-based self-critic. It achieves up to a 10.3% improvement on critique-correction and error identification benchmarks.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning [112.35483894933904]
We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought. LookBack significantly improves critique and correction performance by up to 13.5%.
arXiv Detail & Related papers (2024-12-03T05:04:49Z)
Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision. Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
arXiv Detail & Related papers (2024-11-25T18:28:26Z)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z)
Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback. pipeline aggregates high-quality critiques from multiple agents instead of a single model. Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z)
Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoT is a framework that pushes LLMs toward System-2-like critic capability. CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation. Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance.
arXiv Detail & Related papers (2024-08-29T08:02:09Z)
CriticEval: Evaluating Large Language Model as Critic [110.29766259843453]
CriticEval is a novel benchmark designed to comprehensively and reliably evaluate critique ability of Large Language Models. To ensure the comprehensiveness, CriticEval evaluates critique ability from four dimensions across nine diverse task scenarios. To ensure the reliability, a large number of critiques are annotated to serve as references.
arXiv Detail & Related papers (2024-02-21T12:38:59Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Critique Ability of Large Language Models [38.34144195927209]
This study explores the ability of large language models (LLMs) to deliver accurate critiques across various tasks. We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses.
arXiv Detail & Related papers (2023-10-07T14:12:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.