Related papers: AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

URL: http://arxiv.org/abs/2508.16402v1
Date: Fri, 22 Aug 2025 14:04:55 GMT
Title: AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions
Authors: Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, Chenyang Ran, Guang Tian, Shoutai Zhu, Yeheng Duan, Zhenghui Kang, Zhenxing Lin, Shangshu Li, Qiang Luo, Qingshen Long, Zhiyong Chen, Yihan Xiao, Yurong Wu, Daoguang Zan, Yuyi Fu, Mingxuan Wang, Ming Ding,
Abstract summary: Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs)<n>We argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers.<n>We present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and I CPC.
Score: 37.21656149034477
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.

Related papers

LLMs in Code Vulnerability Analysis: A Proof of Concept [0.3441021278275805]
Traditional software security analysis methods struggle to keep pace with the scale and complexity of moderns.<n>This paper explores the incorporation of code-specific and general-purpose Large Language Models to automate critical software security tasks.
arXiv Detail & Related papers (2026-01-13T16:16:11Z)
CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models [10.539924362853233]
We introduce CFCEval, a framework for evaluating the quality and security of code generated by Large Language Models (LLMs)<n>CFCEval mitigates dataset bias by creating a new benchmark, MLVBench, and incorporates ELRM, a new metric designed to assess the relevance between reference code and generated code.<n>Our experiments show that CFCEval not only captures both quality and security aspects of generated code more effectively but also that its ELRM aligns more closely with human judgments than CodeBLEU.
arXiv Detail & Related papers (2025-12-06T02:20:31Z)
How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework [8.76693832650115]
Overestimation in evaluating large language models (LLMs) has become an increasing concern.<n>We propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography.
arXiv Detail & Related papers (2025-07-25T12:39:03Z)
Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition? [53.863591321231276]
Humanity's Last Code Exam (HLCE) comprises 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI)<n>As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation.<n>We observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively.
arXiv Detail & Related papers (2025-06-15T04:03:31Z)
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? [88.29001498765629]
Large language models (LLMs) now outperform elite humans in competitive programming.<n>We revisit this claim, examining how LLMs differ from human experts and where limitations still remain.<n>We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI.<n>A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions.
arXiv Detail & Related papers (2025-06-13T16:29:09Z)
ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests [85.72404266850982]
We propose textbfICPC-Eval, a top-level competitive coding benchmark designed to probing the frontiers of reasoning.<n>ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world.<n>Results underscore the significant challenge in evaluating complex reasoning abilities.
arXiv Detail & Related papers (2025-06-05T11:20:37Z)
How to Get Your LLM to Generate Challenging Problems for Evaluation [33.625052642068624]
CHASE is a unified framework to synthetically generate challenging problems using Large Language Models.<n>We implement CHASE to create evaluation benchmarks across three diverse domains.<n>The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy.
arXiv Detail & Related papers (2025-02-20T16:09:55Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments.<n>CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.