Related papers: J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge

J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge

URL: http://arxiv.org/abs/2505.11875v1
Date: Sat, 17 May 2025 06:58:42 GMT
Title: J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge
Authors: Chi-Min Chan, Chunpu Xu, Jiaming Ji, Zhen Ye, Pengcheng Wen, Chunyang Jiang, Yaodong Yang, Wei Xue, Sirui Han, Yike Guo,
Abstract summary: This paper introduces $textbfJ1-7B$, which is first supervised fine-tuned on reflection-enhanced datasets collected via rejection-sampling.<n>At inference time, we apply Simple Test-Time Scaling (STTS) strategies for additional performance improvement.<n> Experimental results demonstrate that $textbfJ1-7B$ surpasses the previous state-of-the-art LLM-as-a-Judge by $ textbf4.8$% and exhibits a $ textbf5.1$% stronger scaling trend under STTS.
Score: 24.607213170485743
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The current focus of AI research is shifting from emphasizing model training towards enhancing evaluation quality, a transition that is crucial for driving further advancements in AI systems. Traditional evaluation methods typically rely on reward models assigning scalar preference scores to outputs. Although effective, such approaches lack interpretability, leaving users often uncertain about why a reward model rates a particular response as high or low. The advent of LLM-as-a-Judge provides a more scalable and interpretable method of supervision, offering insights into the decision-making process. Moreover, with the emergence of large reasoning models, which consume more tokens for deeper thinking and answer refinement, scaling test-time computation in the LLM-as-a-Judge paradigm presents an avenue for further boosting performance and providing more interpretability through reasoning traces. In this paper, we introduce $\textbf{J1-7B}$, which is first supervised fine-tuned on reflection-enhanced datasets collected via rejection-sampling and subsequently trained using Reinforcement Learning (RL) with verifiable rewards. At inference time, we apply Simple Test-Time Scaling (STTS) strategies for additional performance improvement. Experimental results demonstrate that $\textbf{J1-7B}$ surpasses the previous state-of-the-art LLM-as-a-Judge by $ \textbf{4.8}$\% and exhibits a $ \textbf{5.1}$\% stronger scaling trend under STTS. Additionally, we present three key findings: (1) Existing LLM-as-a-Judge does not inherently exhibit such scaling trend. (2) Model simply fine-tuned on reflection-enhanced datasets continues to demonstrate similarly weak scaling behavior. (3) Significant scaling trend emerges primarily during the RL phase, suggesting that effective STTS capability is acquired predominantly through RL training.

Related papers

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning [12.83211408922535]
Reinforcement learning-style post-training improves reasoning by optimizing model outputs based on reward or preference signals.<n> GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier.<n>We propose $textbfSelf-Explanation Policy Optimization (ExPO)$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer.
arXiv Detail & Related papers (2025-07-03T17:44:55Z)
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs [49.01449646799905]
We show that most existing reasoning models do not extrapolate well.<n>Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores.<n>Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.
arXiv Detail & Related papers (2025-06-10T17:52:42Z)
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [82.75174050101108]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z)
Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory [79.63672515243765]
In this paper, we focus on a standard and realistic scaling setting: majority voting.<n>We show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought.<n>We propose a method according to probability theory to quickly and accurately predict the scaling performance and select the best strategy under large sampling times.
arXiv Detail & Related papers (2025-05-16T08:28:57Z)
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [67.30809748319486]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z)
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [34.806610134389366]
NoisyRollout is a simple yet effective data augmentation method that mixes trajectories from both clean and distorted images during RL training.<n>By injecting targeted diversity in visual perception and the resulting reasoning patterns, NoisyRollout promotes better policy exploration through vision-oriented inductive biases.<n>NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models across $5$ out-of-domain reasoning and perception benchmarks.
arXiv Detail & Related papers (2025-04-17T16:10:13Z)
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning [7.78764814568908]
We first propose CLS-RL for MLLM image classification, using verifiable rewards for fine-tuning.<n>We then rethink and question whether explicit thinking in RFT is always necessary.<n>No-Thinking-RL explores RFT without thinking by introducing a simple equality accuracy reward.
arXiv Detail & Related papers (2025-03-20T14:37:45Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [17.432401371613903]
We propose a resource-efficient, System-2 thinking framework for code correctness evaluation.<n>MCTS-Judge uses Monte Carlo Tree Search to decompose problems into simpler, multi-perspective evaluations.<n>High-precision, unit-test-level reward mechanism encourages the Large Language Model to perform line-by-line analysis.
arXiv Detail & Related papers (2025-02-18T02:55:48Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.