GPT vs Human for Scientific Reviews: A Dual Source Review on
Applications of ChatGPT in Science
- URL: http://arxiv.org/abs/2312.03769v1
- Date: Tue, 5 Dec 2023 21:41:52 GMT
- Title: GPT vs Human for Scientific Reviews: A Dual Source Review on
Applications of ChatGPT in Science
- Authors: Chenxi Wu, Alan John Varghese, Vivek Oommen, George Em Karniadakis
- Abstract summary: We consider 13 GPT-related papers across different scientific domains, reviewed by a human reviewer and SciSpace, a large language model.
We found that 50% of SciSpace's responses to objective questions align with those of a human reviewer.
In subjective questions, the uninformed evaluators showed varying preferences between SciSpace and human responses.
- Score: 1.8434042562191815
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The new polymath Large Language Models (LLMs) can speed-up greatly scientific
reviews, possibly using more unbiased quantitative metrics, facilitating
cross-disciplinary connections, and identifying emerging trends and research
gaps by analyzing large volumes of data. However, at the present time, they
lack the required deep understanding of complex methodologies, they have
difficulty in evaluating innovative claims, and they are unable to assess
ethical issues and conflicts of interest. Herein, we consider 13 GPT-related
papers across different scientific domains, reviewed by a human reviewer and
SciSpace, a large language model, with the reviews evaluated by three distinct
types of evaluators, namely GPT-3.5, a crowd panel, and GPT-4. We found that
50% of SciSpace's responses to objective questions align with those of a human
reviewer, with GPT-4 (informed evaluator) often rating the human reviewer
higher in accuracy, and SciSpace higher in structure, clarity, and
completeness. In subjective questions, the uninformed evaluators (GPT-3.5 and
crowd panel) showed varying preferences between SciSpace and human responses,
with the crowd panel showing a preference for the human responses. However,
GPT-4 rated them equally in accuracy and structure but favored SciSpace for
completeness.
Related papers
- Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.
Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z) - Automated Focused Feedback Generation for Scientific Writing Assistance [6.559560602099439]
SWIF$2$T: a Scientific WrIting Focused Feedback Tool.
It is designed to generate specific, actionable and coherent comments, which identify weaknesses in a scientific paper and/or propose revisions to it.
We compile a dataset of 300 peer reviews citing weaknesses in scientific papers and conduct human evaluation.
The results demonstrate the superiority in specificity, reading comprehension, and overall helpfulness of SWIF$2$T's feedback compared to other approaches.
arXiv Detail & Related papers (2024-05-30T20:56:41Z) - An Empirical Analysis on Large Language Models in Debate Evaluation [10.677407097411768]
We investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation.
We uncover a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented.
We also uncover lexical biases in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential.
arXiv Detail & Related papers (2024-05-28T18:34:53Z) - Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [92.66784679667441]
Prometheus 2 is a more powerful evaluator LM that closely mirrors human and GPT-4 judgements.
It is capable of processing both direct assessment and pairwise ranking formats grouped with a user-defined evaluation criteria.
On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges.
arXiv Detail & Related papers (2024-05-02T17:59:35Z) - GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation [93.55550787058012]
This paper presents an automatic, versatile, and human-aligned evaluation metric for text-to-3D generative models.
To this end, we first develop a prompt generator using GPT-4V to generate evaluating prompts.
We then design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria.
arXiv Detail & Related papers (2024-01-08T18:52:09Z) - Can large language models provide useful feedback on research papers? A
large-scale empirical analysis [38.905758846360435]
High-quality peer reviews are increasingly difficult to obtain.
With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback.
We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
arXiv Detail & Related papers (2023-10-03T04:14:17Z) - Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks.
In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science.
We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.