Critique Ability of Large Language Models
- URL: http://arxiv.org/abs/2310.04815v1
- Date: Sat, 7 Oct 2023 14:12:15 GMT
- Title: Critique Ability of Large Language Models
- Authors: Liangchen Luo, Zi Lin, Yinxiao Liu, Lei Shu, Yun Zhu, Jingbo Shang,
Lei Meng
- Abstract summary: This study explores the ability of large language models (LLMs) to deliver accurate critiques across various tasks.
We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses.
- Score: 38.34144195927209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Critical thinking is essential for rational decision-making and
problem-solving. This skill hinges on the ability to provide precise and
reasoned critiques and is a hallmark of human intelligence. In the era of large
language models (LLMs), this study explores the ability of LLMs to deliver
accurate critiques across various tasks. We are interested in this topic as a
capable critic model could not only serve as a reliable evaluator, but also as
a source of supervised signals for model tuning. Particularly, if a model can
self-critique, it has the potential for autonomous self-improvement. To examine
this, we introduce a unified evaluation framework for assessing the critique
abilities of LLMs. We develop a benchmark called CriticBench, which comprises
3K high-quality natural language queries and corresponding model responses; and
annotate the correctness of these responses. The benchmark cover tasks such as
math problem-solving, code completion, and question answering. We evaluate
multiple LLMs on the collected dataset and our analysis reveals several
noteworthy insights: (1) Critique is generally challenging for most LLMs, and
this capability often emerges only when models are sufficiently large. (2) In
particular, self-critique is especially difficult. Even top-performing LLMs
struggle to achieve satisfactory performance. (3) Models tend to have lower
critique accuracy on problems where they are most uncertain. To this end, we
introduce a simple yet effective baseline named self-check, which leverages
self-critique to improve task performance for various models. We hope this
study serves as an initial exploration into understanding the critique
abilities of LLMs, and aims to inform future research, including the
development of more proficient critic models and the application of critiques
across diverse tasks.
Related papers
- Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback.
pipeline aggregates high-quality critiques from multiple agents instead of a single model.
Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z) - Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoT is a framework that pushes LLMs toward System-2-like critic capability.
CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation.
Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance.
arXiv Detail & Related papers (2024-08-29T08:02:09Z) - CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [26.45110574463893]
CriticBench is a benchmark designed to assess Large Language Models' abilities to critique and rectify their reasoning.
We evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning.
arXiv Detail & Related papers (2024-02-22T18:59:02Z) - CriticEval: Evaluating Large Language Model as Critic [110.29766259843453]
CriticEval is a novel benchmark designed to comprehensively and reliably evaluate critique ability of Large Language Models.
To ensure the comprehensiveness, CriticEval evaluates critique ability from four dimensions across nine diverse task scenarios.
To ensure the reliability, a large number of critiques are annotated to serve as references.
arXiv Detail & Related papers (2024-02-21T12:38:59Z) - The Critique of Critique [45.40025444461465]
We pioneer the critique of critique, termed MetaCritique, which builds specific quantification criteria.
We construct a meta-evaluation dataset covering 4 tasks involving human-written and LLM-generated critiques.
Experiments demonstrate that MetaCritique can achieve near-human performance.
arXiv Detail & Related papers (2024-01-09T12:20:41Z) - Large Language Models Cannot Self-Correct Reasoning Yet [78.16697476530994]
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities.
Concerns persist regarding the accuracy and appropriateness of their generated content.
A contemporary methodology, self-correction, has been proposed as a remedy to these issues.
arXiv Detail & Related papers (2023-10-03T04:56:12Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Self-critiquing models for assisting human evaluators [11.1006983438712]
We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning.
On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed.
Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs.
arXiv Detail & Related papers (2022-06-12T17:40:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.