Related papers: MARG: Multi-Agent Review Generation for Scientific Papers

MARG: Multi-Agent Review Generation for Scientific Papers

URL: http://arxiv.org/abs/2401.04259v1
Date: Mon, 8 Jan 2024 22:24:17 GMT
Title: MARG: Multi-Agent Review Generation for Scientific Papers
Authors: Mike D'Arcy, Tom Hope, Larry Birnbaum, Doug Downey
Abstract summary: We develop MARG, a feedback generation approach using multiple LLM instances that engage in internal discussion. By distributing paper text across agents, MARG can consume the full text of papers beyond the input length limitations of the base LLM. In a user study, baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time. Our system substantially improves the ability of GPT-4 to generate specific and helpful feedback, reducing the rate of generic comments from 60% to 29% and generating 3.7 good comments per paper (a 2.2x improvement)
Score: 28.78019426139167
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the ability of LLMs to generate feedback for scientific papers and develop MARG, a feedback generation approach using multiple LLM instances that engage in internal discussion. By distributing paper text across agents, MARG can consume the full text of papers beyond the input length limitations of the base LLM, and by specializing agents and incorporating sub-tasks tailored to different comment types (experiments, clarity, impact) it improves the helpfulness and specificity of feedback. In a user study, baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time, and only 1.7 comments per paper were rated as good overall in the best baseline. Our system substantially improves the ability of GPT-4 to generate specific and helpful feedback, reducing the rate of generic comments from 60% to 29% and generating 3.7 good comments per paper (a 2.2x improvement).

Related papers

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process [30.710131188931317]
DeepReview is a framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. In its best mode, DeepReviewer-14B achieves win rates of 88.21% and 80.20% against GPT-o1 and DeepSeek-R1 in evaluations.
arXiv Detail & Related papers (2025-03-11T15:59:43Z)
OpenReviewer: A Specialized Large Language Model for Generating Critical Scientific Paper Reviews [3.660182910533372]
OpenReviewer is an open-source system for generating high-quality peer reviews of machine learning and AI conference papers. Llama-OpenReviewer-8B is an 8B parameter language model specifically fine-tuned on 79,000 expert reviews from top conferences.
arXiv Detail & Related papers (2024-12-16T16:31:00Z)
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review. The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z)
Streamlining the review process: AI-generated annotations in research manuscripts [0.5735035463793009]
This study explores the potential of integrating Large Language Models (LLMs) into the peer-review process to enhance efficiency without compromising effectiveness. We focus on manuscript annotations, particularly excerpt highlights, as a potential area for AI-human collaboration. This paper introduces AnnotateGPT, a platform that utilizes GPT-4 for manuscript review, aiming to improve reviewers' comprehension and focus.
arXiv Detail & Related papers (2024-11-29T23:26:34Z)
Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study [13.650356901064807]
This user study was performed in two organizations, Mozilla and Ubisoft. We observed that 8.1% and 7.2%, respectively, of LLM-generated comments were accepted by reviewers in each organization. Refactoring-related comments are more likely to be accepted than Functional comments.
arXiv Detail & Related papers (2024-11-11T16:12:11Z)
AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons. We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs. We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
Improving the Validity of Automatically Generated Feedback via Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL) Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z)
Reviewer2: Optimizing Review Generation Through Prompt Generation [27.379753994272875]
We propose an efficient two-stage review generation framework called Reviewer2. Unlike prior work, this approach explicitly models the distribution of possible aspects that the review may address. We generate a large-scale review dataset of 27k papers and 99k reviews that we annotate with aspect prompts.
arXiv Detail & Related papers (2024-02-16T18:43:10Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality. We propose LLMRefine, an inference time optimization method to refine LLM's output. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
Can large language models provide useful feedback on research papers? A large-scale empirical analysis [38.905758846360435]
High-quality peer reviews are increasingly difficult to obtain. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback. We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
arXiv Detail & Related papers (2023-10-03T04:14:17Z)
Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science. We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z)
Self-Refine: Iterative Refinement with Self-Feedback [62.78755306241981]
Self-Refine is an approach for improving initial outputs from large language models (LLMs) through iterative feedback and refinement. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
arXiv Detail & Related papers (2023-03-30T18:30:01Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.