Related papers: Diverse Inference and Verification for Advanced Reasoning

Diverse Inference and Verification for Advanced Reasoning

URL: http://arxiv.org/abs/2502.09955v1
Date: Fri, 14 Feb 2025 07:22:25 GMT
Title: Diverse Inference and Verification for Advanced Reasoning
Authors: Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, Nakul Verma, Avi Shporer, Alon Amit, Madeleine Udell,
Abstract summary: Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding.<n>We use a diverse inference approach that combines multiple models and methods at test time.<n>We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective.
Score: 19.88677753421871
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.

Related papers

Self-Questioning Language Models [51.75087358141567]
We propose an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver.<n>Both the proposer and solver are trained via reinforcement learning.<n>We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces.
arXiv Detail & Related papers (2025-08-05T17:51:33Z)
AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming [2.3020018305241337]
We introduce AlgoSimBench, a new benchmark designed to assess ability to identify algorithmically similar problems (ASPs)<n>AlgoSimBench consists of 1317 problems, annotated with distinct fine-grained algorithm tags, from which we distract 402 multiple-choice questions (MCQs)<n>Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9% accuracy on the MCQ task.<n>We propose attempted solution matching (ASM), a novel method for improving problem similarity detection.
arXiv Detail & Related papers (2025-07-21T08:34:20Z)
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles [46.71887319140096]
We introduce Enigmata, the first comprehensive suite tailored for improving Large Language Models with puzzle reasoning skills.<n>It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation.<n>Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks.
arXiv Detail & Related papers (2025-05-26T12:40:31Z)
Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning [66.43194385702297]
Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL) We propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks.
arXiv Detail & Related papers (2025-04-15T21:37:13Z)
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning [95.31714779585272]
DeepMath-103K is a new, large-scale dataset comprising approximately 103K mathematical problems. Each problem includes a verifiable final answer, enabling rule-based RL. We demonstrate that models trained on DeepMath-103K achieve significant improvements on challenging mathematical benchmarks.
arXiv Detail & Related papers (2025-04-15T17:59:51Z)
Heimdall: test-time scaling on the generative verification [2.662648783972914]
We propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. We also propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving.
arXiv Detail & Related papers (2025-04-14T15:46:33Z)
Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents [27.112239616508834]
Mixture of Refinement Agents (MoRA) is a novel agentic refinement framework for large language models (LLMs)<n>MoRA iteratively refines the LLM generated base solution by correcting the aforementioned errors, resulting in a significant performance improvement for open-source LLMs.<n>We evaluate our approach on the SciEval and MMLU subsets along with our own physics dataset (PhysicsQA)
arXiv Detail & Related papers (2024-12-01T14:15:55Z)
Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths [69.39559168050923]
We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
arXiv Detail & Related papers (2024-10-07T06:37:25Z)
Learning Task Decomposition to Assist Humans in Competitive Programming [90.4846613669734]
We introduce a novel objective for learning task decomposition, termed value (AssistV) We collect a dataset of human repair experiences on different decomposed solutions. Under 177 hours of human study, our method enables non-experts to solve 33.3% more problems, speeds them up by 3.3x, and empowers them to match unassisted experts.
arXiv Detail & Related papers (2024-06-07T03:27:51Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z)
Can Language Models Solve Olympiad Programming? [40.54366634332231]
This paper introduces the USACO benchmark with 307 problems from the USA Computing Olympiad. We construct and test a range of LM inference methods for competitive programming for the first time. We find GPT-4 only achieves a 8.7% pass@1 accuracy with zero-shot chain-of-thought prompting.
arXiv Detail & Related papers (2024-04-16T23:27:38Z)
DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs) We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers. In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.