Critique Ability of Large Language Models
- URL: http://arxiv.org/abs/2310.04815v1
- Date: Sat, 7 Oct 2023 14:12:15 GMT
- Title: Critique Ability of Large Language Models
- Authors: Liangchen Luo, Zi Lin, Yinxiao Liu, Lei Shu, Yun Zhu, Jingbo Shang,
Lei Meng
- Abstract summary: This study explores the ability of large language models (LLMs) to deliver accurate critiques across various tasks.
We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses.
- Score: 38.34144195927209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Critical thinking is essential for rational decision-making and
problem-solving. This skill hinges on the ability to provide precise and
reasoned critiques and is a hallmark of human intelligence. In the era of large
language models (LLMs), this study explores the ability of LLMs to deliver
accurate critiques across various tasks. We are interested in this topic as a
capable critic model could not only serve as a reliable evaluator, but also as
a source of supervised signals for model tuning. Particularly, if a model can
self-critique, it has the potential for autonomous self-improvement. To examine
this, we introduce a unified evaluation framework for assessing the critique
abilities of LLMs. We develop a benchmark called CriticBench, which comprises
3K high-quality natural language queries and corresponding model responses; and
annotate the correctness of these responses. The benchmark cover tasks such as
math problem-solving, code completion, and question answering. We evaluate
multiple LLMs on the collected dataset and our analysis reveals several
noteworthy insights: (1) Critique is generally challenging for most LLMs, and
this capability often emerges only when models are sufficiently large. (2) In
particular, self-critique is especially difficult. Even top-performing LLMs
struggle to achieve satisfactory performance. (3) Models tend to have lower
critique accuracy on problems where they are most uncertain. To this end, we
introduce a simple yet effective baseline named self-check, which leverages
self-critique to improve task performance for various models. We hope this
study serves as an initial exploration into understanding the critique
abilities of LLMs, and aims to inform future research, including the
development of more proficient critic models and the application of critiques
across diverse tasks.
Related papers
- Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - CriticAL: Critic Automation with Language Models [31.1575961776287]
CriticAL generates summary statistics that capture discrepancies between model predictions and data.
CriticAL reliably generates correct critiques without hallucinating incorrect ones.
arXiv Detail & Related papers (2024-11-10T20:41:35Z) - Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback.
pipeline aggregates high-quality critiques from multiple agents instead of a single model.
Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z) - Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoT is a framework that pushes LLMs toward System-2-like critic capability.
CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation.
Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance.
arXiv Detail & Related papers (2024-08-29T08:02:09Z) - CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [26.45110574463893]
CriticBench is a benchmark designed to assess Large Language Models' abilities to critique and rectify their reasoning.
We evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning.
arXiv Detail & Related papers (2024-02-22T18:59:02Z) - CriticEval: Evaluating Large Language Model as Critic [110.29766259843453]
CriticEval is a novel benchmark designed to comprehensively and reliably evaluate critique ability of Large Language Models.
To ensure the comprehensiveness, CriticEval evaluates critique ability from four dimensions across nine diverse task scenarios.
To ensure the reliability, a large number of critiques are annotated to serve as references.
arXiv Detail & Related papers (2024-02-21T12:38:59Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.