Related papers: Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

URL: http://arxiv.org/abs/2512.14917v1
Date: Tue, 16 Dec 2025 21:12:53 GMT
Title: Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings
Authors: Changshu Liu, Alireza Ghazanfari, Yang Chen, Reyhaneh Jabbarvand,
Abstract summary: RE2-Bench is a benchmark of 1,101 reasoning problems, including 195 drawn from mature real-world projects.<n>A comprehensive evaluation of six general-purpose and reasoning-oriented LLMs on two widely used code reasoning tasks using RE2-Bench reveals a significant performance drop from Easy to Hard problems.
Score: 5.30570508258782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Existing benchmarks involve simple programs, failing to represent real-world complexities such as inter- or intra-procedural dependencies, core or third-party API calls, highly nested constructs, and non-primitive complex types. Evaluating LLMs under such a simplistic setting poses a significant threat to assumptions about their generalizability in practice. To enable a more realistic evaluation of code reasoning, this paper proposes RE2-Bench, a benchmark of 1,101 reasoning problems, including 195 drawn from mature real-world projects. RE2-Bench leverages static and dynamic program analysis to automatically serialize and deserialize compound, complex, and custom types in real-world code, going far beyond the primitive-only settings used in prior work. A key feature of RE2-Bench is categorizing each reasoning problem as Easy or Hard via a principled majority-vote mechanism over nine interpretable code complexity metrics, resulting in two well-separated and semantically meaningful difficulty categories suitable for precise calibration of LLM reasoning ability. A comprehensive evaluation of six general-purpose and reasoning-oriented LLMs on two widely used code reasoning tasks -- input prediction and output prediction -- using RE2-Bench reveals a significant performance drop from Easy to Hard problems (51.50\% for input prediction and 42.15\% for output prediction), confirming that prior evaluations substantially overestimate the reasoning capabilities of LLMs.

Related papers

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing [49.12477222994131]
We propose a Vision Language ReaSoning Benchmark (VLRS-Bench) for complex remote sensing reasoning.<n>VLRS-Bench comprises 2,000 question-answer pairs with an average length of 71 words, spanning 14 tasks and up to eight temporal phases.
arXiv Detail & Related papers (2026-02-04T08:21:33Z)
OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling [13.57588221678224]
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling.<n>The boundaries of their capabilities in automated formulation and problem solving remain poorly understood.<n>We propose OPT-ENGINE, a benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels.
arXiv Detail & Related papers (2026-01-09T09:22:33Z)
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs [1.0519693622157462]
We introduce seqBench, a benchmark for probing sequential reasoning limits in Large Language Models (LLMs)<n>We find that even top-performing models systematically fail on seqBench's structured reasoning tasks despite minimal search complexity.
arXiv Detail & Related papers (2025-09-21T01:32:13Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Computational Thinking Reasoning in Large Language Models [69.28428524878885]
Computational Thinking Model (CTM) is a novel framework that incorporates computational thinking paradigms into large language models (LLMs)<n>Live code execution is seamlessly integrated into the reasoning process, allowing CTM to think by computing.<n>CTM outperforms conventional reasoning models and tool-augmented baselines in terms of accuracy, interpretability, and generalizability.
arXiv Detail & Related papers (2025-06-03T09:11:15Z)
Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z)
Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models [13.083179473480705]
Large Reasoning Models (LRMs) have achieved breakthroughs in complex reasoning tasks through explicit chains of thought.<n>Their heavy reliance on system 2 thinking may limit their system 1 thinking capabilities.<n>S1-Bench introduces a suite of simple, diverse, and natural questions to assess LRMs' performance on questions more suitable for system 1.
arXiv Detail & Related papers (2025-04-14T16:13:23Z)
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment [21.12989936864145]
Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose Reasoning-as-Logic-Units (RaLU), which constructs a more reliable reasoning path by aligning logical units between the generated program and their corresponding NL descriptions.
arXiv Detail & Related papers (2025-02-05T08:23:18Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
LLMs for Relational Reasoning: How Far are We? [8.840750655261251]
Large language models (LLMs) have revolutionized many areas by achieving state-of-the-art performance on downstream tasks. Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems.
arXiv Detail & Related papers (2024-01-17T08:22:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.