Related papers: AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

URL: http://arxiv.org/abs/2512.17375v1
Date: Fri, 19 Dec 2025 09:22:11 GMT
Title: AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens
Authors: Tung-Ling Li, Yuhao Wu, Hongliang Liu,
Abstract summary: We show that short sequences of low-perplexity control tokens can flip many binary evaluations from correct No'' judgments to incorrect Yes'' judgments.<n>We show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives.
Score: 9.127363793428119
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.

Related papers

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal [84.71254539482369]
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures.<n>We present CARE, a failure-centric post-training framework for multimodal reasoning that turns errors into supervision.<n> CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures.
arXiv Detail & Related papers (2025-12-22T16:34:21Z)
One Token Embedding Is Enough to Deadlock Your Large Reasoning Model [91.48868589442837]
We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow.<n>Our method achieves a 100% attack success rate across four advanced LRMs.
arXiv Detail & Related papers (2025-10-12T07:42:57Z)
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers [90.50039419576807]
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling.<n>To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $0,1$ during training.<n>This choice carries a cost: it introduces textitfalse negatives (rejecting correct answers, FNs) and textitfalse positives (accepting incorrect ones, FPs)
arXiv Detail & Related papers (2025-10-01T13:56:44Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification [28.63435151584449]
We propose SelfJudge, which trains judge verifiers via self-supervision of the target model.<n>Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses.
arXiv Detail & Related papers (2025-09-26T02:21:12Z)
One Token to Fool LLM-as-a-Judge [52.45386385722788]
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models.<n>We uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking.
arXiv Detail & Related papers (2025-07-11T17:55:22Z)
But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors [0.0]
Judge Using Safety-Steered Alternatives (JUSSA) is a framework that employs steering vectors during inference to generate more honest alternatives.<n>We evaluate JUSSA on sycophancy detection and introduce a new manipulation dataset covering multiple types of manipulation.<n>Our work opens new directions for scalable model auditing as systems become increasingly sophisticated.
arXiv Detail & Related papers (2025-05-23T11:34:02Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.