Related papers: The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

URL: http://arxiv.org/abs/2509.04484v3
Date: Mon, 22 Sep 2025 08:57:11 GMT
Title: The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors
Authors: Abdelrahman Sadallah, Tim Baumgärtner, Iryna Gurevych, Ted Briscoe,
Abstract summary: We identify four key aspects of review comments that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness.<n>We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes.<n>We benchmark fine-tuned models for assessing review comments on these aspects and generating rationales.
Score: 45.98233565214142
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.

Related papers

EchoReview: Learning Peer Review from the Echoes of Scientific Citations [48.852960317704486]
EchoReview is a citation-context-driven data synthesis framework.<n>It transforms scientific community's long-term judgments into structured review-style data.<n>It can achieve significant and stable improvements on core review dimensions such as evidence support and review comprehensiveness.
arXiv Detail & Related papers (2026-01-31T13:55:38Z)
Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review [53.99984738447279]
Recent work frames this task as automatic text generation, underusing author expertise and intent.<n>We introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement.<n>To support this formulation, we construct Re$3$Align, the first large-scale dataset of aligned review-response--revision triplets.
arXiv Detail & Related papers (2026-01-19T14:07:10Z)
ReviewScore: Misinformed Peer Review Detection with Large Language Models [38.92827930465428]
We show that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed.<n>We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation.<n>We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality.
arXiv Detail & Related papers (2025-09-25T22:55:05Z)
LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews [74.87393214734114]
This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories.<n>Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting.<n> instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
arXiv Detail & Related papers (2025-04-15T10:07:33Z)
Identifying Aspects in Peer Reviews [59.02879434536289]
We develop a data-driven schema for deriving aspects from a corpus of peer reviews.<n>We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis.
arXiv Detail & Related papers (2025-04-09T14:14:42Z)
Generative Adversarial Reviews: When LLMs Become the Critic [1.2430809884830318]
We introduce Generative Agent Reviewers (GAR), leveraging LLM-empowered agents to simulate faithful peer reviewers.<n>Central to this approach is a graph-based representation of manuscripts, condensing content and logically organizing information.<n>Our experiments demonstrate that GAR performs comparably to human reviewers in providing detailed feedback and predicting paper outcomes.
arXiv Detail & Related papers (2024-12-09T06:58:17Z)
A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [55.33653554387953]
Pattern Analysis and Machine Intelligence (PAMI) has led to numerous literature reviews aimed at collecting and fragmented information.<n>This paper presents a thorough analysis of these literature reviews within the PAMI field.<n>We try to address three core research questions: (1) What are the prevalent structural and statistical characteristics of PAMI literature reviews; (2) What strategies can researchers employ to efficiently navigate the growing corpus of reviews; and (3) What are the advantages and limitations of AI-generated reviews compared to human-authored ones.
arXiv Detail & Related papers (2024-02-20T11:28:50Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
ReAct: A Review Comment Dataset for Actionability (and more) [0.8885727065823155]
We introduce an annotated review comment dataset ReAct. The review comments are sourced from OpenReview site. We crowd-source annotations for these reviews for actionability and type of comments.
arXiv Detail & Related papers (2022-10-02T07:09:38Z)
On Faithfulness and Coherence of Language Explanations for Recommendation Systems [8.143715142450876]
This work probes state-of-the-art models and their review generation component. We show that the generated explanations are brittle and need further evaluation before being taken as literal rationales for the estimated ratings.
arXiv Detail & Related papers (2022-09-12T17:00:31Z)
User and Item-aware Estimation of Review Helpfulness [4.640835690336653]
We investigate the role of deviations in the properties of reviews as helpfulness determinants. We propose a novel helpfulness estimation model that extends previous ones. Our model is thus an effective tool to select relevant user feedback for decision-making.
arXiv Detail & Related papers (2020-11-20T15:35:56Z)
How Useful are Reviews for Recommendation? A Critical Review and Potential Improvements [8.471274313213092]
We investigate a growing body of work that seeks to improve recommender systems through the use of review text. Our initial findings reveal several discrepancies in reported results, partly due to copying results across papers despite changes in experimental settings or data pre-processing. Further investigation calls for discussion on a much larger problem about the "importance" of user reviews for recommendation.
arXiv Detail & Related papers (2020-05-25T16:30:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.