Can large language models provide useful feedback on research papers? A
large-scale empirical analysis
- URL: http://arxiv.org/abs/2310.01783v1
- Date: Tue, 3 Oct 2023 04:14:17 GMT
- Title: Can large language models provide useful feedback on research papers? A
large-scale empirical analysis
- Authors: Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding,
Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel
McFarland, James Zou
- Abstract summary: High-quality peer reviews are increasingly difficult to obtain.
With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback.
We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
- Score: 38.905758846360435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Expert feedback lays the foundation of rigorous research. However, the rapid
growth of scholarly production and intricate knowledge specialization challenge
the conventional scientific feedback mechanisms. High-quality peer reviews are
increasingly difficult to obtain. Researchers who are more junior or from
under-resourced settings have especially hard times getting timely feedback.
With the breakthrough of large language models (LLM) such as GPT-4, there is
growing interest in using LLMs to generate scientific feedback on research
manuscripts. However, the utility of LLM-generated feedback has not been
systematically studied. To address this gap, we created an automated pipeline
using GPT-4 to provide comments on the full PDFs of scientific papers. We
evaluated the quality of GPT-4's feedback through two large-scale studies. We
first quantitatively compared GPT-4's generated feedback with human peer
reviewer feedback in 15 Nature family journals (3,096 papers in total) and the
ICLR machine learning conference (1,709 papers). The overlap in the points
raised by GPT-4 and by human reviewers (average overlap 30.85% for Nature
journals, 39.23% for ICLR) is comparable to the overlap between two human
reviewers (average overlap 28.58% for Nature journals, 35.25% for ICLR). The
overlap between GPT-4 and human reviewers is larger for the weaker papers. We
then conducted a prospective user study with 308 researchers from 110 US
institutions in the field of AI and computational biology to understand how
researchers perceive feedback generated by our GPT-4 system on their own
papers. Overall, more than half (57.4%) of the users found GPT-4 generated
feedback helpful/very helpful and 82.4% found it more beneficial than feedback
from at least some human reviewers. While our findings show that LLM-generated
feedback can help researchers, we also identify several limitations.
Related papers
- OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs [151.79792315631965]
We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and citation-backed responses.
On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model.
OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs.
arXiv Detail & Related papers (2024-11-21T15:07:42Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs [41.64918533152914]
We investigate whether large language models (LLMs) are capable of generating references based on two forms of sentence queries.
From around 20K research articles, we make the following deductions on public and proprietary LLMs.
Our study contributes valuable insights into the reliability of RAG for automated citation generation tasks.
arXiv Detail & Related papers (2024-05-03T16:38:51Z) - Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals.
Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z) - MARG: Multi-Agent Review Generation for Scientific Papers [28.78019426139167]
We develop MARG, a feedback generation approach using multiple LLM instances that engage in internal discussion.
By distributing paper text across agents, MARG can consume the full text of papers beyond the input length limitations of the base LLM.
In a user study, baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time.
Our system substantially improves the ability of GPT-4 to generate specific and helpful feedback, reducing the rate of generic comments from 60% to 29% and generating 3.7 good comments per paper (a 2.2x improvement)
arXiv Detail & Related papers (2024-01-08T22:24:17Z) - GPT vs Human for Scientific Reviews: A Dual Source Review on
Applications of ChatGPT in Science [1.8434042562191815]
We consider 13 GPT-related papers across different scientific domains, reviewed by a human reviewer and SciSpace, a large language model.
We found that 50% of SciSpace's responses to objective questions align with those of a human reviewer.
In subjective questions, the uninformed evaluators showed varying preferences between SciSpace and human responses.
arXiv Detail & Related papers (2023-12-05T21:41:52Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks.
In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science.
We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z) - GPT4 is Slightly Helpful for Peer-Review Assistance: A Pilot Study [0.0]
GPT4 was developed to assist in the peer-review process.
By comparing reviews generated by both human reviewers and GPT models for academic papers submitted to a major machine learning conference, we provide initial evidence that artificial intelligence can contribute effectively to the peer-review process.
arXiv Detail & Related papers (2023-06-16T23:11:06Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.