The Critique of Critique
- URL: http://arxiv.org/abs/2401.04518v2
- Date: Sat, 1 Jun 2024 17:52:14 GMT
- Title: The Critique of Critique
- Authors: Shichao Sun, Junlong Li, Weizhe Yuan, Ruifeng Yuan, Wenjie Li, Pengfei Liu,
- Abstract summary: We pioneer the critique of critique, termed MetaCritique, which builds specific quantification criteria.
We construct a meta-evaluation dataset covering 4 tasks involving human-written and LLM-generated critiques.
Experiments demonstrate that MetaCritique can achieve near-human performance.
- Score: 45.40025444461465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Critique, as a natural language description for assessing the quality of model-generated content, has played a vital role in the training, evaluation, and refinement of LLMs. However, a systematic method to evaluate the quality of critique is lacking. In this paper, we pioneer the critique of critique, termed MetaCritique, which builds specific quantification criteria. To achieve a reliable evaluation outcome, we propose Atomic Information Units (AIUs), which describe the critique in a more fine-grained manner. MetaCritique aggregates each AIU's judgment for the overall score. Moreover, MetaCritique delivers a natural language rationale for the intricate reasoning within each judgment. Lastly, we construct a meta-evaluation dataset covering 4 tasks across 16 public datasets involving human-written and LLM-generated critiques. Experiments demonstrate that MetaCritique can achieve near-human performance. Our study can facilitate future research in LLM critiques based on our following observations and released resources: (1) superior critiques judged by MetaCritique can lead to better refinements, indicating that it can potentially enhance the alignment of existing LLMs; (2) the leaderboard of critique models reveals that open-source critique models commonly suffer from factuality issues; (3) relevant code and data are publicly available at https://github.com/GAIR-NLP/MetaCritique to support deeper exploration; (4) an API at PyPI with the usage documentation in Appendix C allows users to assess the critique conveniently.
Related papers
- Large Language Models as Evaluators for Recommendation Explanations [23.938202791437337]
We investigate whether LLMs can serve as evaluators of recommendation explanations.
We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users.
Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts.
arXiv Detail & Related papers (2024-06-05T13:23:23Z) - Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [92.66784679667441]
Prometheus 2 is a more powerful evaluator LM that closely mirrors human and GPT-4 judgements.
It is capable of processing both direct assessment and pairwise ranking formats grouped with a user-defined evaluation criteria.
On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges.
arXiv Detail & Related papers (2024-05-02T17:59:35Z) - CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [26.45110574463893]
CriticBench is a benchmark designed to assess Large Language Models' abilities to critique and rectify their reasoning.
We evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning.
arXiv Detail & Related papers (2024-02-22T18:59:02Z) - CriticBench: Evaluating Large Language Models as Critic [115.8286183749499]
CriticBench is a novel benchmark designed to comprehensively and reliably evaluate four key critique ability dimensions of Large Language Models (LLMs)
CriticBench encompasses nine diverse tasks, each assessing the LLMs' ability to critique responses at varying levels of quality granularity.
Our extensive evaluations of open-source and closed-source LLMs reveal intriguing relationships between the critique ability and tasks, response qualities, and model scales.
arXiv Detail & Related papers (2024-02-21T12:38:59Z) - A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [58.6354685593418]
This paper proposes several article-level, field-normalized, and large language model-empowered bibliometric indicators to evaluate reviews.
The newly emerging AI-generated literature reviews are also appraised.
This work offers insights into the current challenges of literature reviews and envisions future directions for their development.
arXiv Detail & Related papers (2024-02-20T11:28:50Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Critique Ability of Large Language Models [38.34144195927209]
This study explores the ability of large language models (LLMs) to deliver accurate critiques across various tasks.
We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses.
arXiv Detail & Related papers (2023-10-07T14:12:15Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.