GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions
- URL: http://arxiv.org/abs/2310.15405v1
- Date: Mon, 23 Oct 2023 23:24:57 GMT
- Title: GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions
- Authors: Ting-Yao Hsu, Chieh-Yang Huang, Ryan Rossi, Sungchul Kim, C. Lee Giles
and Ting-Hao K. Huang
- Abstract summary: This paper investigates using large language models (LLMs) as a cost-effective, reference-free method for evaluating figure captions.
We first constructed SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600 scientific figure captions.
We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption based on its potential to aid reader understanding.
- Score: 22.181665641802468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is growing interest in systems that generate captions for scientific
figures. However, assessing these systems output poses a significant challenge.
Human evaluation requires academic expertise and is costly, while automatic
evaluation depends on often low-quality author-written captions. This paper
investigates using large language models (LLMs) as a cost-effective,
reference-free method for evaluating figure captions. We first constructed
SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600
scientific figure captions, both original and machine-made, for 600 arXiv
figures. We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption
based on its potential to aid reader understanding, given relevant context such
as figure-mentioning paragraphs. Results show that GPT-4, used as a zero-shot
evaluator, outperformed all other models and even surpassed assessments made by
Computer Science and Informatics undergraduates, achieving a Kendall
correlation score of 0.401 with Ph.D. students rankings
Related papers
- SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings [28.973082312034343]
This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions.
SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality.
A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing.
arXiv Detail & Related papers (2024-03-26T15:16:14Z) - A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science [3.124884279860061]
Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science.
A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading.
arXiv Detail & Related papers (2024-03-21T17:09:08Z) - VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation [39.88401703956412]
VIEScore is a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks.
We evaluate VIEScore on seven prominent tasks in conditional image tasks and found: VIEScore (GPT4-o) achieves a high Spearman correlation of 0.4 with human evaluations, while the human-to-human correlation is 0.45.
VIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v in evaluating synthetic images.
arXiv Detail & Related papers (2023-12-22T17:45:19Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - SciCap+: A Knowledge Augmented Dataset to Study the Challenges of
Scientific Figure Captioning [18.94446071846939]
Figure caption generation helps move model understandings of scientific documents beyond text.
We extend the large-scale SciCap dataset to include mention-paragraphs (paragraphs mentioning figures) and OCR tokens.
Our results indicate that mention-paragraphs serves as additional context knowledge, which significantly boosts the automatic standard image caption evaluation scores.
arXiv Detail & Related papers (2023-06-06T08:16:16Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - SciCap: Generating Captions for Scientific Figures [20.696070723932866]
We introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020.
After pre-processing, SCICAP contained more than two million figures extracted from over 290,000 papers.
We established baseline models that caption graph plots, the dominant (19.2%) figure type.
arXiv Detail & Related papers (2021-10-22T07:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.