Related papers: Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

URL: http://arxiv.org/abs/2601.11332v1
Date: Fri, 16 Jan 2026 14:29:54 GMT
Title: Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming
Authors: Sama Hadhoud, Alaa Elsetohy, Frederikus Hudi, Jan Christian Blaise Cruz, Steven Halim, Alham Fikri Aji,
Abstract summary: We argue that competitive programming is fundamentally a problem-solving task.<n>We propose centering natural-language editorials in both solution generation and evaluation.
Score: 15.736641361222125
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.

Related papers

CodeClash: Benchmarking Goal-Oriented Software Engineering [63.65464283837602]
We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas.<n>Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning.<n>We open-source CodeClash to advance the study of autonomous, goal-oriented code development.
arXiv Detail & Related papers (2025-11-02T07:42:51Z)
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions [37.21656149034477]
Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs)<n>We argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers.<n>We present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and I CPC.
arXiv Detail & Related papers (2025-08-22T14:04:55Z)
Evaluating and Improving Large Language Models for Competitive Program Generation [18.564450345359468]
This study aims to evaluate and improve large language models (LLMs) in solving real-world competitive programming problems.<n>We collect 117 problems from nine regional ICPC/CCPC contests held in 2024 and design four filtering criteria to construct a curated benchmark consisting of 80 problems.<n>We evaluate its competitive program generation capabilities through the online judge (OJ) platforms, guided by a carefully designed basic prompt.
arXiv Detail & Related papers (2025-06-28T17:18:23Z)
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? [88.29001498765629]
Large language models (LLMs) now outperform elite humans in competitive programming.<n>We revisit this claim, examining how LLMs differ from human experts and where limitations still remain.<n>We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI.<n>A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions.
arXiv Detail & Related papers (2025-06-13T16:29:09Z)
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments.<n>CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs [2.3020018305241337]
Distilling explicit chain-of-thought reasoning paths has emerged as an effective method for improving the reasoning abilities of large language models. We propose a novel approach to distill reasoning abilities from LLMs by leveraging their capacity to explain solutions. Our experiments demonstrate that learning from explanations enables the Reasoner to more effectively guide program implementation by a Coder.
arXiv Detail & Related papers (2024-04-11T22:19:50Z)
RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought [56.558892336235914]
Reversing Chain-of-Thought (RCoT) is a novel method to improve large language models' reasoning abilities. RCoT automatically detects and rectifys factual inconsistency in generated solutions. We show that manually written fine-grained feedback can dramatically improve LLMs' reasoning abilities.
arXiv Detail & Related papers (2023-05-19T08:02:52Z)
Fully Autonomous Programming with Large Language Models [0.9558392439655015]
Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome" We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation. The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
arXiv Detail & Related papers (2023-04-20T16:12:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.