When AI reviews science: Can we trust the referee?
Abstract Overview
This paper provides a security- and reliability-centered analysis of AI peer review by mapping attack surfaces across the full review lifecycle—training and data retrieval, desk review, deep review, rebuttal, and system-level vulnerabilities. The authors instantiate this threat taxonomy with four controlled treatment-control probes on 100 stratified ICLR 2025 submissions, evaluated by Gemini 2.5 and GPT 5.1 as AI referees. The experiments demonstrate that review scores can be shifted by prestige cues, rhetorical style, evidence-free rebuttals, and biased contextual information, indicating that current AI referees are sensitive to non-merit factors. The paper also proposes stage-specific defense strategies and outlines future research directions for trustworthy AI peer review.
Novelty
The distinctive contribution is a lifecycle-wide threat model for AI peer review combined with quantitative causal-style probes tied to specific review stages. Rather than analyzing only prompt injection or generic LLM-as-judge fragility, the paper evaluates multiple representative attack vectors—prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning—on real conference submissions with two advanced reviewer models.
Results
The experiments show measurable score distortions: high-prestige framing increased scores by +0.21 to +0.29 while low-prestige framing caused larger decreases of -0.59 to -0.85; cautious language was penalized by -0.26 to -0.52 relative to originals; evidence-free but confident rebuttals raised scores by +0.42 to +0.65 across both models. Contextual poisoning effects varied by model, with Gemini 2.5 susceptible to positive framing (+0.16) and GPT 5.1 susceptible to negative framing (-0.31), confirming that AI referees' evaluations are permeable to manipulated information environments.
Key Points
- The paper proposes an end-to-end taxonomy of attacks on AI peer review spanning training/data retrieval, desk review, deep review, rebuttal, and system-level vulnerabilities, with analysis of mechanisms, attacker prerequisites, concealment, and difficulty.
- On 100 ICLR 2025 papers, both Gemini 2.5 and GPT 5.1 exhibited authority bias (asymmetric, with low-prestige penalties exceeding high-prestige boosts), a systematic penalty for cautious scientific language, and significant upward score shifts after assertive evidence-free rebuttals.
- Biased retrieved context altered review scores in a model-dependent manner—Gemini 2.5 was swayed by positive framing while GPT 5.1 was penalized by negative framing—demonstrating that retrieval or knowledge-base contamination can subtly influence AI-assisted scientific evaluation.