MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation
- URL: http://arxiv.org/abs/2502.12468v1
- Date: Tue, 18 Feb 2025 02:55:48 GMT
- Title: MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation
- Authors: Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, Guillaume Sartoretti,
- Abstract summary: We propose a resource-efficient, System-2 thinking framework for code correctness evaluation.
MCTS-Judge uses Monte Carlo Tree Search to decompose problems into simpler, multi-perspective evaluations.
High-precision, unit-test-level reward mechanism encourages the Large Language Model to perform line-by-line analysis.
- Score: 17.432401371613903
- License:
- Abstract: The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as programming. Inspired by recent advances in reasoning models and shifts in scaling laws, we pioneer bringing test-time computation into LLM-as-a-Judge, proposing MCTS-Judge, a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge leverages Monte Carlo Tree Search (MCTS) to decompose problems into simpler, multi-perspective evaluations. Through a node-selection strategy that combines self-assessment based on historical actions in the current trajectory and the Upper Confidence Bound for Trees based on prior rollouts, MCTS-Judge balances global optimization and refinement of the current trajectory. We further designed a high-precision, unit-test-level reward mechanism to encourage the Large Language Model (LLM) to perform line-by-line analysis. Extensive experiments on three benchmarks and five LLMs demonstrate the effectiveness of MCTS-Judge, which improves the base model's accuracy from 41% to 80%, surpassing the o1-series models with 3x fewer tokens. Further evaluations validate the superiority of its reasoning trajectory in logic, analytics, thoroughness, and overall quality, while revealing the test-time scaling law of the LLM-as-a-Judge paradigm.
Related papers
- S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.
Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)
RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.
RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [78.28188747489769]
We propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge.
In a self-training loop, EvalPlanner iteratively optimize over synthetically constructed evaluation plans and executions.
Our method achieves a new state-of-the-art performance for generative reward models on RewardBench.
arXiv Detail & Related papers (2025-01-30T02:21:59Z) - Are Your LLMs Capable of Stable Reasoning? [38.03049704515947]
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks.
However, a significant discrepancy persists between benchmark performances and real-world applications.
We introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance.
We present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems.
arXiv Detail & Related papers (2024-12-17T18:12:47Z) - Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning [13.082135438792475]
Chain of Self-Correction embeds self-correction as an inherent ability in Large Language Models.
CoSC operates through a sequence of self-correction stages.
Experiments show that CoSC significantly boosts performance on standard mathematical datasets.
arXiv Detail & Related papers (2024-10-14T17:16:44Z) - LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path.
The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z) - Interpretable Contrastive Monte Carlo Tree Search Reasoning [25.11379135302235]
We propose SC-MCTS: a novel Monte Carlo Tree Search (MCTS) reasoning algorithm for Large Language Models (LLMs)
We show that SC-MCTS significantly improves both reasoning accuracy and speed.
We outperformed o1-mini by an average of 17.4% on the Blocksworld multi-step reasoning dataset using Llama-3.1-70B with SC-MCTS*.
arXiv Detail & Related papers (2024-10-02T16:15:31Z) - Reasoning Aware Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling [9.44858963874474]
Self-Consistency mitigates hallucinations in Large Language Models (LLMs) by sampling multiple reasoning paths.
We introduce Reasoning-Aware Self-Consistency (RASC), a novel framework that enhances sampling efficiency and reasoning faithfulness.
arXiv Detail & Related papers (2024-08-30T05:14:59Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.