Related papers: OJBench: A Competition Level Code Benchmark For Large Language Models

OJBench: A Competition Level Code Benchmark For Large Language Models

URL: http://arxiv.org/abs/2506.16395v1
Date: Thu, 19 Jun 2025 15:27:02 GMT
Title: OJBench: A Competition Level Code Benchmark For Large Language Models
Authors: Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, Tianyu Liu, Weiran Xu,
Abstract summary: OJBench is a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of large language models (LLMs)<n>We conduct a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models.<n>Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems.
Score: 23.061564017225734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models' reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.

Related papers

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? [88.29001498765629]
Large language models (LLMs) now outperform elite humans in competitive programming.<n>We revisit this claim, examining how LLMs differ from human experts and where limitations still remain.<n>We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI.<n>A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions.
arXiv Detail & Related papers (2025-06-13T16:29:09Z)
Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark [0.0]
The Frame Problem and the Symbol Grounding Problem have historically been viewed as unsolvable within traditional symbolic AI systems.<n>This study investigates whether modern LLMs possess the cognitive capacities required to address these problems.
arXiv Detail & Related papers (2025-06-09T16:12:47Z)
ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests [85.72404266850982]
We propose textbfICPC-Eval, a top-level competitive coding benchmark designed to probing the frontiers of reasoning.<n>ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world.<n>Results underscore the significant challenge in evaluating complex reasoning abilities.
arXiv Detail & Related papers (2025-06-05T11:20:37Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
ProBench: Benchmarking Large Language Models in Competitive Programming [44.09445715541973]
We propose ProBench to benchmark large language models (LLMs) in competitive programming.<n>ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms.<n>We assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation.
arXiv Detail & Related papers (2025-02-28T09:12:42Z)
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments.<n>CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code [34.03774442237902]
Large Language Models applied to code-related applications have emerged as a prominent field. Existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. We propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code.
arXiv Detail & Related papers (2024-03-12T17:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.