CriticAL: Critic Automation with Language Models
- URL: http://arxiv.org/abs/2411.06590v1
- Date: Sun, 10 Nov 2024 20:41:35 GMT
- Title: CriticAL: Critic Automation with Language Models
- Authors: Michael Y. Li, Vivek Vajipey, Noah D. Goodman, Emily B. Fox,
- Abstract summary: CriticAL generates summary statistics that capture discrepancies between model predictions and data.
CriticAL reliably generates correct critiques without hallucinating incorrect ones.
- Score: 31.1575961776287
- License:
- Abstract: Understanding the world through models is a fundamental goal of scientific research. While large language model (LLM) based approaches show promise in automating scientific discovery, they often overlook the importance of criticizing scientific models. Criticizing models deepens scientific understanding and drives the development of more accurate models. Automating model criticism is difficult because it traditionally requires a human expert to define how to compare a model with data and evaluate if the discrepancies are significant--both rely heavily on understanding the modeling assumptions and domain. Although LLM-based critic approaches are appealing, they introduce new challenges: LLMs might hallucinate the critiques themselves. Motivated by this, we introduce CriticAL (Critic Automation with Language Models). CriticAL uses LLMs to generate summary statistics that capture discrepancies between model predictions and data, and applies hypothesis tests to evaluate their significance. We can view CriticAL as a verifier that validates models and their critiques by embedding them in a hypothesis testing framework. In experiments, we evaluate CriticAL across key quantitative and qualitative dimensions. In settings where we synthesize discrepancies between models and datasets, CriticAL reliably generates correct critiques without hallucinating incorrect ones. We show that both human and LLM judges consistently prefer CriticAL's critiques over alternative approaches in terms of transparency and actionability. Finally, we show that CriticAL's critiques enable an LLM scientist to improve upon human-designed models on real-world datasets.
Related papers
- Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision.
Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
arXiv Detail & Related papers (2024-11-25T18:28:26Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback.
pipeline aggregates high-quality critiques from multiple agents instead of a single model.
Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - CriticBench: Benchmarking LLMs for Critique-Correct Reasoning [26.45110574463893]
CriticBench is a benchmark designed to assess Large Language Models' abilities to critique and rectify their reasoning.
We evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning.
arXiv Detail & Related papers (2024-02-22T18:59:02Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Critique Ability of Large Language Models [38.34144195927209]
This study explores the ability of large language models (LLMs) to deliver accurate critiques across various tasks.
We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses.
arXiv Detail & Related papers (2023-10-07T14:12:15Z) - Self-critiquing models for assisting human evaluators [11.1006983438712]
We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning.
On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed.
Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs.
arXiv Detail & Related papers (2022-06-12T17:40:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.