ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews
- URL: http://arxiv.org/abs/2503.08506v1
- Date: Tue, 11 Mar 2025 14:56:58 GMT
- Title: ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews
- Authors: Xian Gao, Jiacheng Ruan, Jingsheng Gao, Ting Liu, Yuzhuo Fu,
- Abstract summary: Academic paper review is a critical yet time-consuming task within the research community.<n>With the increasing volume of academic publications, automating the review process has become a significant challenge.<n>We propose ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews.
- Score: 26.031039064337907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Academic paper review is a critical yet time-consuming task within the research community. With the increasing volume of academic publications, automating the review process has become a significant challenge. The primary issue lies in generating comprehensive, accurate, and reasoning-consistent review comments that align with human reviewers' judgments. In this paper, we address this challenge by proposing ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews. We first introduce a novel dataset, Review-CoT, consisting of 142k review comments, designed for training LLM agents. This dataset emulates the structured reasoning process of human reviewers-summarizing the paper, referencing relevant works, identifying strengths and weaknesses, and generating a review conclusion. Building upon this, we train LLM reviewer agents capable of structured reasoning using a relevant-paper-aware training method. Furthermore, we construct ReviewAgents, a multi-role, multi-LLM agent review framework, to enhance the review comment generation process. Additionally, we propose ReviewBench, a benchmark for evaluating the review comments generated by LLMs. Our experimental results on ReviewBench demonstrate that while existing LLMs exhibit a certain degree of potential for automating the review process, there remains a gap when compared to human-generated reviews. Moreover, our ReviewAgents framework further narrows this gap, outperforming advanced LLMs in generating review comments.
Related papers
- Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025 [115.86204862475864]
Review Feedback Agent provides automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers.
It was implemented at ICLR 2025 as a large randomized control study.
27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers.
arXiv Detail & Related papers (2025-04-13T22:01:25Z) - Automatically Evaluating the Paper Reviewing Capability of Large Language Models [46.0003776499898]
Large Language Models (LLMs) show potential for providing assistance, but research has reported significant limitations in the reviews they generate.
We developed an automatic evaluation pipeline to assess the LLMs' paper review capability by comparing them with expert-generated reviews.
arXiv Detail & Related papers (2025-02-24T12:05:27Z) - Generative Adversarial Reviews: When LLMs Become the Critic [1.2430809884830318]
We introduce Generative Agent Reviewers (GAR), leveraging LLM-empowered agents to simulate faithful peer reviewers.<n>Central to this approach is a graph-based representation of manuscripts, condensing content and logically organizing information.<n>Our experiments demonstrate that GAR performs comparably to human reviewers in providing detailed feedback and predicting paper outcomes.
arXiv Detail & Related papers (2024-12-09T06:58:17Z) - Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.<n>The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.<n>We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z) - AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons.
We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs.
We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Reviewer2: Optimizing Review Generation Through Prompt Generation [28.050468098801872]
We propose an efficient two-stage review generation framework called Reviewer2.<n>Unlike prior work, this approach explicitly models the distribution of possible aspects that the review may address.<n>We generate a large-scale review dataset of 27k papers and 99k reviews that we annotate with aspect prompts.
arXiv Detail & Related papers (2024-02-16T18:43:10Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.