CriticBench: Evaluating Large Language Models as Critic
- URL: http://arxiv.org/abs/2402.13764v3
- Date: Fri, 23 Feb 2024 02:44:52 GMT
- Title: CriticBench: Evaluating Large Language Models as Critic
- Authors: Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen,
Xian-ling Mao
- Abstract summary: CriticBench is a novel benchmark designed to comprehensively and reliably evaluate four key critique ability dimensions of Large Language Models (LLMs)
CriticBench encompasses nine diverse tasks, each assessing the LLMs' ability to critique responses at varying levels of quality granularity.
Our extensive evaluations of open-source and closed-source LLMs reveal intriguing relationships between the critique ability and tasks, response qualities, and model scales.
- Score: 115.8286183749499
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Critique ability are crucial in the scalable oversight and self-improvement
of Large Language Models (LLMs). While many recent studies explore the critique
ability of LLMs to judge and refine flaws in generations, how to
comprehensively and reliably measure the critique abilities of LLMs is
under-explored. This paper introduces CriticBench, a novel benchmark designed
to comprehensively and reliably evaluate four key critique ability dimensions
of LLMs: feedback, comparison, refinement and meta-feedback. CriticBench
encompasses nine diverse tasks, each assessing the LLMs' ability to critique
responses at varying levels of quality granularity. Our extensive evaluations
of open-source and closed-source LLMs reveal intriguing relationships between
the critique ability and tasks, response qualities, and model scales. Datasets,
resources and evaluation toolkit for CriticBench will be publicly released at
https://github.com/open-compass/CriticBench.
Related papers
- An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability [2.8948274245812327]
We study the effects of evaluation design, decoding strategies, and Chain-of-Tought (CoT) reasoning in evaluation.<n>Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.
arXiv Detail & Related papers (2025-06-16T16:04:43Z) - CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [97.18215355266143]
We introduce a holistic code critique benchmark for Large Language Models (LLMs) called CodeCriticBench.
Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties.
Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics.
arXiv Detail & Related papers (2025-02-23T15:36:43Z) - RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs)
Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z) - Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT (Self-evolving CRITic) is a framework that enables genuine self-evolution of critique abilities.
It self-improves by training on synthetic data, generated by a contrastive-based self-critic.
It achieves up to a 10.3% improvement on critique-correction and error identification benchmarks.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning [112.35483894933904]
We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs.
VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought.
LookBack significantly improves critique and correction performance by up to 13.5%.
arXiv Detail & Related papers (2024-12-03T05:04:49Z) - Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback.
pipeline aggregates high-quality critiques from multiple agents instead of a single model.
Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z) - Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoT is a framework that pushes LLMs toward System-2-like critic capability.
CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation.
Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance.
arXiv Detail & Related papers (2024-08-29T08:02:09Z) - A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations [35.12731651234186]
Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities.
We systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations.
Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
arXiv Detail & Related papers (2024-07-04T17:15:37Z) - CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [26.45110574463893]
CriticBench is a benchmark designed to assess Large Language Models' abilities to critique and rectify their reasoning.
We evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning.
arXiv Detail & Related papers (2024-02-22T18:59:02Z) - The Critique of Critique [45.40025444461465]
We pioneer the critique of critique, termed MetaCritique, which builds specific quantification criteria.
We construct a meta-evaluation dataset covering 4 tasks involving human-written and LLM-generated critiques.
Experiments demonstrate that MetaCritique can achieve near-human performance.
arXiv Detail & Related papers (2024-01-09T12:20:41Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Critique Ability of Large Language Models [38.34144195927209]
This study explores the ability of large language models (LLMs) to deliver accurate critiques across various tasks.
We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses.
arXiv Detail & Related papers (2023-10-07T14:12:15Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.