Related papers: Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

URL: http://arxiv.org/abs/2510.13744v1
Date: Wed, 15 Oct 2025 16:50:54 GMT
Title: Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math
Authors: Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, Shafiq Joty,
Abstract summary: We introduce Hard2Verify, a step-level verification benchmark produced with over 500 hours of human labor.<n>We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models.
Score: 80.46254366870447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM)-based reasoning systems have recently achieved gold medal-level performance in the IMO 2025 competition, writing mathematical proofs where, to receive full credit, each step must be not only correct but also sufficiently supported. To train LLM-based reasoners in such challenging, open-ended settings, strong verifiers capable of catching step-level mistakes are necessary prerequisites. We introduce Hard2Verify, a human-annotated, step-level verification benchmark produced with over 500 hours of human labor. Hard2Verify is designed to rigorously assess step-level verifiers at the frontier: Verifiers must provide step-level annotations or identify the first error in responses generated by frontier LLMs for very recent, challenging, and open-ended math questions. We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models. We subsequently analyze what drives poor performance in step-level verification, the impacts of scaling verifier compute, as well as fundamental questions such as self-verification and verification-generation dynamics.

Related papers

An Empirical Study of Reasoning Steps in Thinking Code LLMs [8.653365851909745]
Thinking Large Language Models generate explicit intermediate reasoning traces before final answers.<n>This study examines the reasoning process and quality of thinking LLMs for code generation.
arXiv Detail & Related papers (2025-11-08T06:18:48Z)
Verification Limits Code LLM Training [23.67882363039948]
Large language models for code generation increasingly rely on synthetic data, where both problem solutions and verification tests are generated by models.<n>In this work, we study how verification design and strategies influence model performance.
arXiv Detail & Related papers (2025-09-25T07:23:30Z)
Variation in Verification: Understanding Verification Dynamics in Large Language Models [43.829778623942275]
We study generative verifiers, which perform verification by generating chain-of-thought reasoning followed by a binary verdict.<n>Our experiments reveal three key findings about verification effectiveness.
arXiv Detail & Related papers (2025-09-22T16:36:56Z)
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.73714829399802]
This survey probes the core challenges that the rise of Large Language Models poses for evaluation.<n>We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety.<n>We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics.
arXiv Detail & Related papers (2025-04-26T07:48:52Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities.<n>We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move.<n>We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z)
Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information [14.071887353084126]
Chain-of-Thought (CoT) has become a vital technique for enhancing the performance of Large Language Models (LLMs) We propose Wrong-of-Thought (WoT), which includes two core modules. Experiments on 8 popular datasets and 5 LLMs demonstrate that WoT surpasses all previous baselines.
arXiv Detail & Related papers (2024-10-06T12:27:21Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.