MARG: Multi-Agent Review Generation for Scientific Papers
- URL: http://arxiv.org/abs/2401.04259v1
- Date: Mon, 8 Jan 2024 22:24:17 GMT
- Title: MARG: Multi-Agent Review Generation for Scientific Papers
- Authors: Mike D'Arcy, Tom Hope, Larry Birnbaum, Doug Downey
- Abstract summary: We develop MARG, a feedback generation approach using multiple LLM instances that engage in internal discussion.
By distributing paper text across agents, MARG can consume the full text of papers beyond the input length limitations of the base LLM.
In a user study, baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time.
Our system substantially improves the ability of GPT-4 to generate specific and helpful feedback, reducing the rate of generic comments from 60% to 29% and generating 3.7 good comments per paper (a 2.2x improvement)
- Score: 28.78019426139167
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the ability of LLMs to generate feedback for scientific papers and
develop MARG, a feedback generation approach using multiple LLM instances that
engage in internal discussion. By distributing paper text across agents, MARG
can consume the full text of papers beyond the input length limitations of the
base LLM, and by specializing agents and incorporating sub-tasks tailored to
different comment types (experiments, clarity, impact) it improves the
helpfulness and specificity of feedback. In a user study, baseline methods
using GPT-4 were rated as producing generic or very generic comments more than
half the time, and only 1.7 comments per paper were rated as good overall in
the best baseline. Our system substantially improves the ability of GPT-4 to
generate specific and helpful feedback, reducing the rate of generic comments
from 60% to 29% and generating 3.7 good comments per paper (a 2.2x
improvement).
Related papers
- Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study [13.650356901064807]
This user study was performed in two organizations, Mozilla and Ubisoft.
We observed that 8.1% and 7.2%, respectively, of LLM-generated comments were accepted by reviewers in each organization.
Refactoring-related comments are more likely to be accepted than Functional comments.
arXiv Detail & Related papers (2024-11-11T16:12:11Z) - AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons.
We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs.
We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - Improving the Validity of Automatically Generated Feedback via
Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)
Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - Reviewer2: Optimizing Review Generation Through Prompt Generation [27.379753994272875]
We propose an efficient two-stage review generation framework called Reviewer2.
Unlike prior work, this approach explicitly models the distribution of possible aspects that the review may address.
We generate a large-scale review dataset of 27k papers and 99k reviews that we annotate with aspect prompts.
arXiv Detail & Related papers (2024-02-16T18:43:10Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
We propose LLMRefine, an inference time optimization method to refine LLM's output.
We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization.
LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z) - Can large language models provide useful feedback on research papers? A
large-scale empirical analysis [38.905758846360435]
High-quality peer reviews are increasingly difficult to obtain.
With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback.
We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
arXiv Detail & Related papers (2023-10-03T04:14:17Z) - Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks.
In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science.
We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z) - Self-Refine: Iterative Refinement with Self-Feedback [62.78755306241981]
Self-Refine is an approach for improving initial outputs from large language models (LLMs) through iterative feedback and refinement.
We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs.
Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
arXiv Detail & Related papers (2023-03-30T18:30:01Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.