CriticBench: Evaluating Large Language Models as Critic
- URL: http://arxiv.org/abs/2402.13764v3
- Date: Fri, 23 Feb 2024 02:44:52 GMT
- Title: CriticBench: Evaluating Large Language Models as Critic
- Authors: Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen,
Xian-ling Mao
- Abstract summary: CriticBench is a novel benchmark designed to comprehensively and reliably evaluate four key critique ability dimensions of Large Language Models (LLMs)
CriticBench encompasses nine diverse tasks, each assessing the LLMs' ability to critique responses at varying levels of quality granularity.
Our extensive evaluations of open-source and closed-source LLMs reveal intriguing relationships between the critique ability and tasks, response qualities, and model scales.
- Score: 115.8286183749499
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Critique ability are crucial in the scalable oversight and self-improvement
of Large Language Models (LLMs). While many recent studies explore the critique
ability of LLMs to judge and refine flaws in generations, how to
comprehensively and reliably measure the critique abilities of LLMs is
under-explored. This paper introduces CriticBench, a novel benchmark designed
to comprehensively and reliably evaluate four key critique ability dimensions
of LLMs: feedback, comparison, refinement and meta-feedback. CriticBench
encompasses nine diverse tasks, each assessing the LLMs' ability to critique
responses at varying levels of quality granularity. Our extensive evaluations
of open-source and closed-source LLMs reveal intriguing relationships between
the critique ability and tasks, response qualities, and model scales. Datasets,
resources and evaluation toolkit for CriticBench will be publicly released at
https://github.com/open-compass/CriticBench.
Related papers
- A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations [35.12731651234186]
Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities.
We systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations.
Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
arXiv Detail & Related papers (2024-07-04T17:15:37Z) - Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [26.45110574463893]
CriticBench is a benchmark designed to assess Large Language Models' abilities to critique and rectify their reasoning.
We evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning.
arXiv Detail & Related papers (2024-02-22T18:59:02Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods [111.46455901113976]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - The Critique of Critique [45.40025444461465]
We pioneer the critique of critique, termed MetaCritique, which builds specific quantification criteria.
We construct a meta-evaluation dataset covering 4 tasks involving human-written and LLM-generated critiques.
Experiments demonstrate that MetaCritique can achieve near-human performance.
arXiv Detail & Related papers (2024-01-09T12:20:41Z) - AlignBench: Benchmarking Chinese Alignment of Large Language Models [100.30878214336444]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment.
Our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with Chain-of-Thought to generate explanations and final ratings as evaluations.
We report AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that recovers 95% of GPT-4's evaluation ability.
arXiv Detail & Related papers (2023-11-30T17:41:30Z) - Critique Ability of Large Language Models [38.34144195927209]
This study explores the ability of large language models (LLMs) to deliver accurate critiques across various tasks.
We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses.
arXiv Detail & Related papers (2023-10-07T14:12:15Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.