Heimdall: test-time scaling on the generative verification
- URL: http://arxiv.org/abs/2504.10337v2
- Date: Wed, 16 Apr 2025 14:58:26 GMT
- Title: Heimdall: test-time scaling on the generative verification
- Authors: Wenlei Shi, Xing Jin,
- Abstract summary: We propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions.<n>With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems.<n>We also propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving.
- Score: 2.662648783972914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.
Related papers
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory [52.44029486173232]
Dynamic Cheatsheet (DC) is a lightweight framework that endows a black-box language model with a persistent, evolving memory.
DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time.
This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback.
arXiv Detail & Related papers (2025-04-10T17:57:33Z) - Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities [0.0]
"Numberland" is a 100-problem test to evaluate the numerical reasoning abilities of LLM-based agents.
We evaluated five LLM-based agents: OpenAI's o1 and o1-mini, Google Gemini, Microsoft Copilot, and Anthropic Claude.
We tested the top 24 solver (o1 with 73% accuracy) on 25 harder problems, and its score fell to 27%, confirming search as a bottleneck.
arXiv Detail & Related papers (2025-03-31T21:06:39Z) - FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities.
We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move.
We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z) - Diverse Inference and Verification for Advanced Reasoning [19.88677753421871]
Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding.<n>We use a diverse inference approach that combines multiple models and methods at test time.<n>We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective.
arXiv Detail & Related papers (2025-02-14T07:22:25Z) - Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving [0.0]
This study evalu-ates 10 large language models (LLMs) with 7 to 8 billion parameters using the MATH dataset.
The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions.
arXiv Detail & Related papers (2025-01-28T17:11:36Z) - Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results.
We develop a method that avoids introducing external resources, relying instead on perturbations to the input.
Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z) - Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models [73.79091519226026]
Uncertainty of Thoughts (UoT) is an algorithm to augment large language models with the ability to actively seek information by asking effective questions.
In experiments on medical diagnosis, troubleshooting, and the 20 Questions game, UoT achieves an average performance improvement of 38.1% in the rate of successful task completion.
arXiv Detail & Related papers (2024-02-05T18:28:44Z) - SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step
Reasoning [55.76083560152823]
SelfCheck is a general-purpose zero-shot verification schema for recognizing errors in step-by-step reasoning.
We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
arXiv Detail & Related papers (2023-08-01T10:31:36Z) - Learning To Dive In Branch And Bound [95.13209326119153]
We propose L2Dive to learn specific diving structurals with graph neural networks.
We train generative models to predict variable assignments and leverage the duality of linear programs to make diving decisions.
arXiv Detail & Related papers (2023-01-24T12:01:45Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - Towards Explainable Metaheuristic: Mining Surrogate Fitness Models for
Importance of Variables [69.02115180674885]
We use four benchmark problems to train a surrogate model and investigate the learning of the search space by the surrogate model.
We show that the surrogate model picks out key characteristics of the problem as it is trained on population data from each generation.
arXiv Detail & Related papers (2022-05-31T09:16:18Z) - Machine learning for complete intersection Calabi-Yau manifolds: a
methodological study [0.0]
We revisit the question of predicting Hodge numbers $h1,1$ and $h2,1$ of complete Calabi-Yau intersections using machine learning (ML)
We obtain 97% (resp. 99%) accuracy for $h1,1$ using a neural network inspired by the Inception model for the old dataset, using only 30% (resp. 70%) of the data for training.
For the new one, a simple linear regression leads to almost 100% accuracy with 30% of the data for training.
arXiv Detail & Related papers (2020-07-30T19:43:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.