Related papers: Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence

Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence

URL: http://arxiv.org/abs/2601.04766v1
Date: Thu, 08 Jan 2026 09:34:54 GMT
Title: Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence
Authors: Shengyin Sun, Yiming Li, Renxi Liu, Weizhe Lin, Hui-Ling Zhen, Xianzhi Yu, Mingxuan Yuan, Chen Ma,
Abstract summary: Judge Decoding accelerates inference by relaxing the strict verification of Speculative Decoding.<n>In this work, we revisit this paradigm from first principles, revealing that the criticality'' scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence.
Score: 31.435770434219005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality'' scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.

Related papers

JAF: Judge Agent Forest [8.150475950851359]
We introduce JAF: Judge Agent Forest, a framework in which the judge agent conducts joint inference across a cohort of query--response pairs.<n>We develop a flexible locality-sensitive hashing algorithm that learns informative binary codes by integrating semantic embeddings.<n>We validate JAF with an empirical study on the demanding task of cloud misconfigs triage in large-scale cloud environments.
arXiv Detail & Related papers (2026-01-29T19:42:42Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Refinement Provenance Inference: Detecting LLM-Refined Training Prompts from Model Behavior [58.751981587234916]
This paper formalizes the Refinement Provenance Inference (RPI) audit task as Refinement Provenance Inference (RPI)<n>We propose RePro, a logit-based framework that fuses teacher-forced likelihood features with logit-ranking signals.<n>During training, RePro learns a transferable representation via shadow fine-tuning, and uses a lightweight linear head to infer provenance on unseen victims without training-data access.
arXiv Detail & Related papers (2026-01-05T10:16:41Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning [62.452350134196934]
FaithCoT-Bench is a unified benchmark for instance-level CoT unfaithfulness detection.<n>Our framework formulates unfaithfulness detection as a discriminative decision problem.<n>FaithCoT-Bench sets a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.
arXiv Detail & Related papers (2025-10-05T05:16:54Z)
SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification [28.63435151584449]
We propose SelfJudge, which trains judge verifiers via self-supervision of the target model.<n>Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses.
arXiv Detail & Related papers (2025-09-26T02:21:12Z)
ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs [21.409155842171497]
Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs)<n>Errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning.<n>We introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method to address this specific vulnerability.
arXiv Detail & Related papers (2025-08-07T11:26:40Z)
Feedback Guidance of Diffusion Models [14.162420300295365]
Interval-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models.<n>We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need.
arXiv Detail & Related papers (2025-06-06T13:46:32Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Zero-Shot Verification-guided Chain of Thoughts [64.862738244735]
We focus on self-verification of self-generated reasoning steps via COT prompts in a completely zero-shot regime.<n>To explore this setting, we design a new zero-shot prompt, which we call COT STEP, to aid zero-shot decomposition of reasoning steps.<n>We evaluate the verifiers' ability to classify the correctness of reasoning chains and explore different ways to use verifier scores in guiding reasoning.
arXiv Detail & Related papers (2025-01-21T03:52:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.