Related papers: Detecting LLM-Generated Peer Reviews

Detecting LLM-Generated Peer Reviews

URL: http://arxiv.org/abs/2503.15772v2
Date: Mon, 19 May 2025 01:40:25 GMT
Title: Detecting LLM-Generated Peer Reviews
Authors: Vishisht Rao, Aounon Kumar, Himabindu Lakkaraju, Nihar B. Shah,
Abstract summary: The rise of large language models (LLMs) has introduced concerns that some reviewers may rely on these tools to generate reviews rather than writing them independently.<n>We consider the approach of performing indirect prompt injection via the paper's PDF, prompting the LLM to embed a covert watermark in the generated review.<n>We introduce watermarking schemes and hypothesis tests that control the family-wise error rate across multiple reviews, achieving higher statistical power than standard corrections.
Score: 37.51215252353345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integrity of peer review is fundamental to scientific progress, but the rise of large language models (LLMs) has introduced concerns that some reviewers may rely on these tools to generate reviews rather than writing them independently. Although some venues have banned LLM-assisted reviewing, enforcement remains difficult as existing detection tools cannot reliably distinguish between fully generated reviews and those merely polished with AI assistance. In this work, we address the challenge of detecting LLM-generated reviews. We consider the approach of performing indirect prompt injection via the paper's PDF, prompting the LLM to embed a covert watermark in the generated review, and subsequently testing for presence of the watermark in the review. We identify and address several pitfalls in na\"ive implementations of this approach. Our primary contribution is a rigorous watermarking and detection framework that offers strong statistical guarantees. Specifically, we introduce watermarking schemes and hypothesis tests that control the family-wise error rate across multiple reviews, achieving higher statistical power than standard corrections such as Bonferroni, while making no assumptions about the nature of human-written reviews. We explore multiple indirect prompt injection strategies--including font-based embedding and obfuscated prompts--and evaluate their effectiveness under various reviewer defense scenarios. Our experiments find high success rates in watermark embedding across various LLMs. We also empirically find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice. In contrast, we find that Bonferroni-style corrections are too conservative to be useful in this setting.

Related papers

The Feasibility of Topic-Based Watermarking on Academic Peer Reviews [46.71493672772134]
We evaluate topic-based watermarking (TBW) for large language models (LLMs)<n>Our results show that TBW maintains review quality relative to non-watermarked outputs, while demonstrating strong to paraphrasing-based evasion.
arXiv Detail & Related papers (2025-05-27T18:09:27Z)
ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews [26.031039064337907]
Academic paper review is a critical yet time-consuming task within the research community.<n>With the increasing volume of academic publications, automating the review process has become a significant challenge.<n>We propose ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews.
arXiv Detail & Related papers (2025-03-11T14:56:58Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.<n>The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.<n>We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z)
Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review [8.606381080620789]
We investigate the ability of existing AI text detection algorithms to distinguish between peer reviews written by humans and different state-of-the-art LLMs.<n>Our analysis shows that existing approaches fail to identify many GPT-4o written reviews without also producing a high number of false positive classifications.<n>We propose a new detection approach which surpasses existing methods in the identification of GPT-4o written peer reviews at low levels of false positive classifications.
arXiv Detail & Related papers (2024-10-03T22:05:06Z)
AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons. We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs. We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z)
Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution [11.418182511485032]
We evaluate comments generated by three Large Language Models (LLMs) We propose the concept of document testing, in which a document is verified by using an LLM to generate tests based on the document, running those tests, and observing whether they pass or fail.
arXiv Detail & Related papers (2024-06-21T02:40:34Z)
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs) In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z)
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We formally define LLM's self-bias - the tendency to favor its own generation. We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z)
PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs. We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.