FuguReport

When AI reviews science: Can we trust the referee?

Authors Jialiang Wang, Yuchen Liu, Hang Xu, Kaichun Hu, Shimin Di, Wangze Ni, Linan Yue, Min-Ling Zhang, Kui Ren, Lei Chen
Affiliations Southeast University / The Hong Kong University of Science and Technology / Zhejiang University
Categories Evaluation / Peer Review Evaluation / Trustworthiness of AI referees, Method / Causal Analysis / Separating framing and contextual bias effects, Application / Scientific Review / AI-assisted science review lifecycle
License CC BY 4.0

Abstract Overview

This paper provides a security- and reliability-centered analysis of AI peer review by mapping attack surfaces across the full review lifecycle—training and data retrieval, desk review, deep review, rebuttal, and system-level vulnerabilities. The authors instantiate this threat taxonomy with four controlled treatment-control probes on 100 stratified ICLR 2025 submissions, evaluated by Gemini 2.5 and GPT 5.1 as AI referees. The experiments demonstrate that review scores can be shifted by prestige cues, rhetorical style, evidence-free rebuttals, and biased contextual information, indicating that current AI referees are sensitive to non-merit factors. The paper also proposes stage-specific defense strategies and outlines future research directions for trustworthy AI peer review.

Novelty

The distinctive contribution is a lifecycle-wide threat model for AI peer review combined with quantitative causal-style probes tied to specific review stages. Rather than analyzing only prompt injection or generic LLM-as-judge fragility, the paper evaluates multiple representative attack vectors—prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning—on real conference submissions with two advanced reviewer models.

Results

The experiments show measurable score distortions: high-prestige framing increased scores by +0.21 to +0.29 while low-prestige framing caused larger decreases of -0.59 to -0.85; cautious language was penalized by -0.26 to -0.52 relative to originals; evidence-free but confident rebuttals raised scores by +0.42 to +0.65 across both models. Contextual poisoning effects varied by model, with Gemini 2.5 susceptible to positive framing (+0.16) and GPT 5.1 susceptible to negative framing (-0.31), confirming that AI referees' evaluations are permeable to manipulated information environments.

Key Points

  1. The paper proposes an end-to-end taxonomy of attacks on AI peer review spanning training/data retrieval, desk review, deep review, rebuttal, and system-level vulnerabilities, with analysis of mechanisms, attacker prerequisites, concealment, and difficulty.
  2. On 100 ICLR 2025 papers, both Gemini 2.5 and GPT 5.1 exhibited authority bias (asymmetric, with low-prestige penalties exceeding high-prestige boosts), a systematic penalty for cautious scientific language, and significant upward score shifts after assertive evidence-free rebuttals.
  3. Biased retrieved context altered review scores in a model-dependent manner—Gemini 2.5 was swayed by positive framing while GPT 5.1 was penalized by negative framing—demonstrating that retrieval or knowledge-base contamination can subtly influence AI-assisted scientific evaluation.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.