Fugu-MT 論文翻訳(概要): AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

論文の概要: AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

arxiv url: http://arxiv.org/abs/2508.16402v1
Date: Fri, 22 Aug 2025 14:04:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-25 16:42:36.403885
Title: AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions
Title（参考訳）: AetherCode: プレミアプログラミングコンペティションでLLMが勝つ能力を評価する
Authors: Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, Chenyang Ran, Guang Tian, Shoutai Zhu, Yeheng Duan, Zhenghui Kang, Zhenxing Lin, Shangshu Li, Qiang Luo, Qingshen Long, Zhiyong Chen, Yihan Xiao, Yurong Wu, Daoguang Zan, Yuyi Fu, Mingxuan Wang, Ming Ding,
Abstract要約: 競合プログラミングは、LLM(Large Language Models)の推論とコーディング能力を評価するための重要なベンチマークとして登場した。現在の評価は、LLMとエリートな人間プログラマの間にかなりのギャップを隠蔽する、状態モデル習熟度を超越している、と我々は主張する。我々は、IOIやI CPCといった主要なプログラミングコンペティションから問題を引き出す新しいベンチマークであるAetherCodeを紹介する。
参考スコア（独自算出の注目度）: 37.21656149034477
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.
Abstract（参考訳）: 競合プログラミングは、LLM(Large Language Models)の推論とコーディング能力を評価するための重要なベンチマークとして登場した。既存のベンチマークでは目覚ましい進歩があったが、現在の評価はモデル習熟度を上回り、LLMと有能な人間プログラマの間にかなりのギャップを隠していると論じている。このギャップは、ベンチマーク問題の難しさとスコープの不足、低品質のテストケースによる評価バイアスの2つの重要な制限から生じます。これらの欠点に対処するため、私たちはIOIやICPCといった主要なプログラミングコンペティションから問題を引き出す新しいベンチマークであるAetherCodeを紹介します。 AetherCodeはさらに、自動生成と人的キュレーションのハイブリッドによって構築された、総合的で専門家公認のテストスイートも組み込まれており、厳格で信頼性の高い評価が保証されている。挑戦的な問題設計と堅牢な評価を組み合わせることで、AetherCodeはLLMの能力をより忠実に測定し、将来のコード推論研究のための新しい標準を設定します。

論文の概要: AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

関連論文リスト