Related papers: Can large language models provide useful feedback on research papers? A large-scale empirical analysis

Can large language models provide useful feedback on research papers? A large-scale empirical analysis

URL: http://arxiv.org/abs/2310.01783v1
Date: Tue, 3 Oct 2023 04:14:17 GMT
Title: Can large language models provide useful feedback on research papers? A large-scale empirical analysis
Authors: Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, James Zou
Abstract summary: High-quality peer reviews are increasingly difficult to obtain. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback. We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
Score: 38.905758846360435
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production and intricate knowledge specialization challenge the conventional scientific feedback mechanisms. High-quality peer reviews are increasingly difficult to obtain. Researchers who are more junior or from under-resourced settings have especially hard times getting timely feedback. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback on research manuscripts. However, the utility of LLM-generated feedback has not been systematically studied. To address this gap, we created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. We evaluated the quality of GPT-4's feedback through two large-scale studies. We first quantitatively compared GPT-4's generated feedback with human peer reviewer feedback in 15 Nature family journals (3,096 papers in total) and the ICLR machine learning conference (1,709 papers). The overlap in the points raised by GPT-4 and by human reviewers (average overlap 30.85% for Nature journals, 39.23% for ICLR) is comparable to the overlap between two human reviewers (average overlap 28.58% for Nature journals, 35.25% for ICLR). The overlap between GPT-4 and human reviewers is larger for the weaker papers. We then conducted a prospective user study with 308 researchers from 110 US institutions in the field of AI and computational biology to understand how researchers perceive feedback generated by our GPT-4 system on their own papers. Overall, more than half (57.4%) of the users found GPT-4 generated feedback helpful/very helpful and 82.4% found it more beneficial than feedback from at least some human reviewers. While our findings show that LLM-generated feedback can help researchers, we also identify several limitations.

Related papers

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges [53.12387628636912]
A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. We conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly.
arXiv Detail & Related papers (2025-04-21T16:20:43Z)
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review. The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z)
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs [151.79792315631965]
We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and citation-backed responses. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs.
arXiv Detail & Related papers (2024-11-21T15:07:42Z)
CycleResearcher: Improving Automated Research via Automated Review [37.03497673861402]
This paper explores the possibility of using open-source post-trained large language models (LLMs) as autonomous agents capable of performing the full cycle of automated research and review. To train these models, we develop two new datasets, reflecting real-world machine learning research and peer review dynamics. Our results demonstrate that CycleReviewer achieves promising performance with a 26.89% reduction in mean absolute error (MAE) compared to individual human reviewers in predicting paper scores.
arXiv Detail & Related papers (2024-10-28T08:10:21Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks. SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs [41.64918533152914]
We investigate whether large language models (LLMs) are capable of generating references based on two forms of sentence queries. From around 20K research articles, we make the following deductions on public and proprietary LLMs. Our study contributes valuable insights into the reliability of RAG for automated citation generation tasks.
arXiv Detail & Related papers (2024-05-03T16:38:51Z)
Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z)
MARG: Multi-Agent Review Generation for Scientific Papers [28.78019426139167]
We develop MARG, a feedback generation approach using multiple LLM instances that engage in internal discussion. By distributing paper text across agents, MARG can consume the full text of papers beyond the input length limitations of the base LLM. In a user study, baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time. Our system substantially improves the ability of GPT-4 to generate specific and helpful feedback, reducing the rate of generic comments from 60% to 29% and generating 3.7 good comments per paper (a 2.2x improvement)
arXiv Detail & Related papers (2024-01-08T22:24:17Z)
GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of ChatGPT in Science [1.8434042562191815]
We consider 13 GPT-related papers across different scientific domains, reviewed by a human reviewer and SciSpace, a large language model. We found that 50% of SciSpace's responses to objective questions align with those of a human reviewer. In subjective questions, the uninformed evaluators showed varying preferences between SciSpace and human responses.
arXiv Detail & Related papers (2023-12-05T21:41:52Z)
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities. Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics. Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z)
Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science. We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z)
GPT4 is Slightly Helpful for Peer-Review Assistance: A Pilot Study [0.0]
GPT4 was developed to assist in the peer-review process. By comparing reviews generated by both human reviewers and GPT models for academic papers submitted to a major machine learning conference, we provide initial evidence that artificial intelligence can contribute effectively to the peer-review process.
arXiv Detail & Related papers (2023-06-16T23:11:06Z)
Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.