L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution
- URL: http://arxiv.org/abs/2503.22832v2
- Date: Thu, 10 Apr 2025 18:45:37 GMT
- Title: L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution
- Authors: Simeng Sun, Cheng-Ping Hsieh, Faisal Ladhak, Erik Arakelyan, Santiago Akle Serano, Boris Ginsburg,
- Abstract summary: Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps.<n>We introduce L0-Bench, a language model benchmark for testing procedural correctness.<n>L0-Bench grades models on their ability to generate step-by-step, error-free execution traces.
- Score: 31.19899557805607
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps, a foundational capability which we term "level-0" reasoning. To systematically evaluate this capability, we introduce L0-Bench, a language model benchmark for testing procedural correctness -- the ability to generate correct reasoning processes, complementing existing benchmarks that primarily focus on outcome correctness. Given synthetic Python functions with simple operations, L0-Bench grades models on their ability to generate step-by-step, error-free execution traces. The synthetic nature of L0-Bench enables systematic and scalable generation of test programs along various axes (e.g., number of trace steps). We evaluate a diverse array of recent closed-source and open-weight models on a baseline test set. All models exhibit degradation as the number of target trace steps increases, while larger models and reasoning-enhanced models better maintain correctness over multiple steps. Additionally, we use L0-Bench to explore test-time scaling along three dimensions: input context length, number of solutions for majority voting, and inference steps. Our results suggest substantial room to improve "level-0" reasoning and potential directions to build more reliable reasoning systems.
Related papers
- How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs [49.61011897610774]
How2Everything is a framework to evaluate and improve goal-conditioned procedure generation.<n>Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics.<n>How2Score is an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal.
arXiv Detail & Related papers (2026-02-09T15:47:14Z) - seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs [1.0519693622157462]
We introduce seqBench, a benchmark for probing sequential reasoning limits in Large Language Models (LLMs)<n>We find that even top-performing models systematically fail on seqBench's structured reasoning tasks despite minimal search complexity.
arXiv Detail & Related papers (2025-09-21T01:32:13Z) - StepWiser: Stepwise Generative Judges for Wiser Reasoning [52.32416311990343]
Process reward models address this by providing step-by-step feedback.<n>Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself.<n>We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
arXiv Detail & Related papers (2025-08-26T17:45:05Z) - STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning [6.282781900938977]
We present STEPWISE-CODEX-Bench (SX-Bench), a novel benchmark for complex multi-function understanding and fine-grained execution reasoning.<n>SX-Bench is highly discriminative, even the state-of-the-art OpenAI-O3 achieves only 78.37 percent accuracy on Hard-Reasoning tasks.
arXiv Detail & Related papers (2025-08-07T09:28:43Z) - Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute [57.16286134405821]
We propose Fractional Reasoning, a framework that enables continuous control over reasoning intensity at inference time.<n>Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor.<n> Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
arXiv Detail & Related papers (2025-06-18T21:15:59Z) - PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics [5.645098175233682]
We contribute a novel type of benchmark evaluating inductive reasoning capabilities of Large Language Models (LLMs)<n>We present a fully automated pipeline that generates problems with controllable difficulty, enabling evaluation of reasoning models.<n>Experiments reveal a substantial performance gap between models that leverage test-time compute or LCoT (long chain-of-thought) reasoning and those that do not.
arXiv Detail & Related papers (2025-05-29T05:51:16Z) - LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling [29.721108461390973]
We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step.<n>PIR identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components.<n>Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets.
arXiv Detail & Related papers (2025-05-25T15:17:57Z) - Self-Steering Language Models [113.96916935955842]
DisCIPL is a method for "self-steering" language models.
DisCIPL uses a Planner model to generate a task-specific inference program.
Our work opens up a design space of highly-parallelized Monte Carlo inference strategies.
arXiv Detail & Related papers (2025-04-09T17:54:22Z) - Adaptive Rectification Sampling for Test-Time Compute Scaling [5.085583751997239]
We propose Adaptive Rectification Sampling (AR-Sampling) to guide large language models to self-correction.
Our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions.
arXiv Detail & Related papers (2025-04-02T02:57:52Z) - Self-Training Elicits Concise Reasoning in Large Language Models [23.475414693530965]
Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens.
We propose simple fine-tuning methods which leverage self-generated concise reasoning paths.
Our method achieves a 30% reduction in output tokens, across five model families on GSM8K and MATH, while maintaining average accuracy.
arXiv Detail & Related papers (2025-02-27T14:14:50Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Examining False Positives under Inference Scaling for Mathematical Reasoning [59.19191774050967]
This paper systematically examines the prevalence of false positive solutions in mathematical problem solving for language models.<n>We explore how false positives influence the inference time scaling behavior of language models.
arXiv Detail & Related papers (2025-02-10T07:49:35Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models [8.22619177301814]
We introduce TestBench, a benchmark for class-level LLM-based test case generation.
We construct a dataset of 108 Java programs from 9 real-world, large-scale projects on GitHub.
We propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate.
arXiv Detail & Related papers (2024-09-26T06:18:06Z) - Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
Large Language Models (LLMs) are often described as instances of foundation models that possess strong generalization obeying scaling laws.<n>We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models claiming strong function.<n>We also observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations.
arXiv Detail & Related papers (2024-06-04T07:43:33Z) - AlphaMath Almost Zero: Process Supervision without Process [6.318873143509028]
We propose an innovative framework, AlphaMath, that bypasses the need for process annotations by leveraging Monte Carlo Tree Search (MCTS)
This framework focuses on unleashing the potential of a well-pretrained LLM to autonomously enhance its mathematical reasoning.
The experimental results on both in-domain and out-of-domain datasets demonstrate that even without GPT-4 or human-annotated process supervision, our AlphaMath framework achieves comparable or superior results to previous state-of-the-art methods.
arXiv Detail & Related papers (2024-05-06T15:20:30Z) - General Purpose Verification for Chain of Thought Prompting [16.381123651223763]
We explore ways to improve reasoning capabilities of Large Language Models (LLMs)
We propose three general principles that a model should adhere to while reasoning.
We apply these constraints to the reasoning steps generated by the LLM to improve the accuracy of the final generation.
arXiv Detail & Related papers (2024-04-30T21:15:17Z) - Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning [75.74103236299477]
Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models.
We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI.
Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
arXiv Detail & Related papers (2023-06-04T17:02:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.