Related papers: LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews

LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews

URL: http://arxiv.org/abs/2504.11042v1
Date: Tue, 15 Apr 2025 10:07:33 GMT
Title: LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews
Authors: Sukannya Purkayastha, Zhuang Li, Anne Lauscher, Lizhen Qu, Iryna Gurevych,
Abstract summary: This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories.<n>Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting.<n> instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
Score: 74.87393214734114
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Peer review is a cornerstone of quality control in scientific publishing. With the increasing workload, the unintended use of `quick' heuristics, referred to as lazy thinking, has emerged as a recurring issue compromising review quality. Automated methods to detect such heuristics can help improve the peer-reviewing process. However, there is limited NLP research on this issue, and no real-world dataset exists to support the development of detection tools. This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Our analysis reveals that Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. However, instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points, highlighting the importance of high-quality training data. Furthermore, a controlled experiment demonstrates that reviews revised with lazy thinking feedback are more comprehensive and actionable than those written without such feedback. We will release our dataset and the enhanced guidelines that can be used to train junior reviewers in the community. (Code available here: https://github.com/UKPLab/arxiv2025-lazy-review)

Related papers

Re$^2$: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions [2.5226834810382113]
We introduce the largest consistency-ensured peer review and rebuttal dataset named Re2.<n>This dataset comprises 19,926 initial submissions, 70,668 review comments, and 53,818 rebuttals from 24 conferences and 21 workshops on OpenReview.
arXiv Detail & Related papers (2025-05-12T16:02:52Z)
Identifying Aspects in Peer Reviews [61.374437855024844]
We develop a data-driven schema for deriving fine-grained aspects from a corpus of peer reviews.<n>We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis.
arXiv Detail & Related papers (2025-04-09T14:14:42Z)
Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation [2.990411348977783]
Open-source datasets are used to train neural models for automated code review tasks.<n>These datasets contain a significant amount of noisy comments that persist despite cleaning methods.<n>We propose a novel approach using large language models (LLMs) to further clean these datasets.
arXiv Detail & Related papers (2025-02-04T22:48:58Z)
Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders [6.538051328482194]
We build upon the recommender that has been in production since 2018 RevRecV1. We find that reviewers were being assigned based on prior authorship of files. Having an individual who is responsible for the review, reduces the time take for reviews by -11%.
arXiv Detail & Related papers (2023-12-28T17:55:13Z)
Can GitHub Issues Help in App Review Classifications? [0.7366405857677226]
We propose a novel approach that assists in augmenting labeled datasets by utilizing information extracted from GitHub issues. Our results demonstrate that using labeled issues for data augmentation can improve the F1-score to 6.3 in bug reports and 7.2 in feature requests.
arXiv Detail & Related papers (2023-08-27T22:01:24Z)
Efficient Few-Shot Fine-Tuning for Opinion Summarization [83.76460801568092]
Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples. We show that a few-shot method based on adapters can easily store in-domain knowledge. We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets.
arXiv Detail & Related papers (2022-05-04T16:38:37Z)
Learning Opinion Summarizers by Selecting Informative Reviews [81.47506952645564]
We collect a large dataset of summaries paired with user reviews for over 31,000 products, enabling supervised training. The content of many reviews is not reflected in the human-written summaries, and, thus, the summarizer trained on random review subsets hallucinates. We formulate the task as jointly learning to select informative subsets of reviews and summarizing the opinions expressed in these subsets.
arXiv Detail & Related papers (2021-09-09T15:01:43Z)
Polarity in the Classroom: A Case Study Leveraging Peer Sentiment Toward Scalable Assessment [4.588028371034406]
Accurately grading open-ended assignments in large or massive open online courses (MOOCs) is non-trivial. In this work, we detail the process by which we create our domain-dependent lexicon and aspect-informed review form. We end by analyzing validity and discussing conclusions from our corpus of over 6800 peer reviews from nine courses.
arXiv Detail & Related papers (2021-08-02T15:45:11Z)
Can We Automate Scientific Reviewing? [89.50052670307434]
We discuss the possibility of using state-of-the-art natural language processing (NLP) models to generate first-pass peer reviews for scientific papers. We collect a dataset of papers in the machine learning domain, annotate them with different aspects of content covered in each review, and train targeted summarization models that take in papers to generate reviews. Comprehensive experimental results show that system-generated reviews tend to touch upon more aspects of the paper than human-written reviews.
arXiv Detail & Related papers (2021-01-30T07:16:53Z)
How Useful are Reviews for Recommendation? A Critical Review and Potential Improvements [8.471274313213092]
We investigate a growing body of work that seeks to improve recommender systems through the use of review text. Our initial findings reveal several discrepancies in reported results, partly due to copying results across papers despite changes in experimental settings or data pre-processing. Further investigation calls for discussion on a much larger problem about the "importance" of user reviews for recommendation.
arXiv Detail & Related papers (2020-05-25T16:30:05Z)
Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof. At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.