ProBench: Benchmarking Large Language Models in Competitive Programming
- URL: http://arxiv.org/abs/2502.20868v1
- Date: Fri, 28 Feb 2025 09:12:42 GMT
- Title: ProBench: Benchmarking Large Language Models in Competitive Programming
- Authors: Lei Yang, Renren Jin, Ling Shi, Jianxiang Peng, Yue Chen, Deyi Xiong,
- Abstract summary: We propose ProBench to benchmark large language models (LLMs) in competitive programming.<n>ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms.<n>We assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation.
- Score: 44.09445715541973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the capability of advanced LLMs in code reasoning. To bridge the gap for high-level code reasoning assessment, we propose ProBench to benchmark LLMs in competitive programming, drawing inspiration from the International Collegiate Programming Contest. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms during the period from July to December 2024, obtaining real test results through online submissions to ensure the fairness and accuracy of the evaluation. We establish a unified problem attribute system, including difficulty grading and algorithm tagging. With carefully collected and annotated data in ProBench, we systematically assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation. Experimental results show that QwQ-32B-Preview achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38, suggesting that models trained with specialized reasoning tasks significantly outperform general-purpose models (even larger than reasoning-oriented models) in programming. Further analysis also reveals key areas for programming capability enhancement, e.g., algorithm adaptability and reasoning sufficiency, providing important insights for the future development of reasoning models.
Related papers
- AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models [86.83875864328984]
We propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi.<n>Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities.
arXiv Detail & Related papers (2025-02-24T07:02:31Z) - MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [17.432401371613903]
We propose a resource-efficient, System-2 thinking framework for code correctness evaluation.
MCTS-Judge uses Monte Carlo Tree Search to decompose problems into simpler, multi-perspective evaluations.
High-precision, unit-test-level reward mechanism encourages the Large Language Model to perform line-by-line analysis.
arXiv Detail & Related papers (2025-02-18T02:55:48Z) - Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation [24.081573908824353]
First-order logic (FOL) reasoning is pivotal for intelligent systems.
Existing benchmarks often rely on extensive human annotation or handcrafted templates.
We propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models with the rigor and precision of symbolic provers.
arXiv Detail & Related papers (2025-02-10T15:31:54Z) - CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments.<n>CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities.
Code reasoning is one of the most essential abilities of code LLMs.
We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z) - LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code [34.03774442237902]
Large Language Models applied to code-related applications have emerged as a prominent field.
Existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities.
We propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code.
arXiv Detail & Related papers (2024-03-12T17:58:04Z) - Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation.
We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets.
We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.