Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?
- URL: http://arxiv.org/abs/2509.00629v1
- Date: Sat, 30 Aug 2025 23:02:12 GMT
- Title: Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?
- Authors: Md Tanzib Hosain, Md Kishor Morol,
- Abstract summary: This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks.<n>We develop and evaluate a variety of LM inference techniques for competitive programming with these resources.<n>Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions.
- Score: 0.34376560669160394
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Among the hardest tasks for humans are those found in competitive programming where problems require sophisticated algorithmic thinking, puzzle solving, and the creation of effective code. As a domain to assess language models (LMs), it has not received enough attention, though. This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks. Each problem includes official analysis, reference code, and sample, high-quality unit, and hidden tests. We are able to develop and evaluate a variety of LM inference techniques for competitive programming with these resources. With zero-shot chain-of-thought prompting, we find that o1 only achieves a 19.1\% pass@1 solve rate. With our best inference technique, which combines multi-turn self-judge with reflection and retrieval over episodic information, raises this to 42.2\%. Furthermore, we conduct a new human-in-the-loop investigation to gain a deeper understanding of the remaining difficulties. Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions. A footstep toward LMs with grounded, imaginative, and algorithmic thinking is provided by our quantitative findings and qualitative research. We open-source our code and data at https://github.com/kraritt/zolve.
Related papers
- Learning to Discover at Test Time [79.84622971773862]
We use AI to discover a new state of the art for a scientific problem.<n>We call this method Test-Time Training to Discover (TTT-Discover)<n>We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology.
arXiv Detail & Related papers (2026-01-22T18:24:00Z) - SciML Agents: Write the Solver, Not the Solution [69.5021018644143]
We introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks.<n>We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants.<n>Preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.
arXiv Detail & Related papers (2025-09-12T02:53:57Z) - AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming [2.3020018305241337]
We introduce AlgoSimBench, a new benchmark designed to assess ability to identify algorithmically similar problems (ASPs)<n>AlgoSimBench consists of 1317 problems, annotated with distinct fine-grained algorithm tags, from which we distract 402 multiple-choice questions (MCQs)<n>Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9% accuracy on the MCQ task.<n>We propose attempted solution matching (ASM), a novel method for improving problem similarity detection.
arXiv Detail & Related papers (2025-07-21T08:34:20Z) - FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming [19.576944188747166]
FormulaOne is a benchmark for graph theory, logic, and algorithms.<n>Our problems are incredibly demanding, requiring an array of reasoning steps.<n>Remarkably, state-of-the-art models like OpenAI's o3 fail entirely on FormulaOne.
arXiv Detail & Related papers (2025-07-17T17:53:55Z) - CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments.<n>CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types.
We show that even the most advanced LLMs fail to solve these problems end-to-end in text.
Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z) - Probeable Problems for Beginner-level Programming-with-AI Contests [0.0]
We conduct a 2-hour programming contest for undergraduate Computer Science students from multiple institutions.
Students were permitted to work individually or in groups, and were free to use AI tools.
We analyze the extent to which the code submitted by these groups identifies missing details and identify ways in which Probeable Problems can support learning in formal and informal CS educational contexts.
arXiv Detail & Related papers (2024-05-24T00:39:32Z) - Can Language Models Solve Olympiad Programming? [40.54366634332231]
This paper introduces the USACO benchmark with 307 problems from the USA Computing Olympiad.
We construct and test a range of LM inference methods for competitive programming for the first time.
We find GPT-4 only achieves a 8.7% pass@1 accuracy with zero-shot chain-of-thought prompting.
arXiv Detail & Related papers (2024-04-16T23:27:38Z) - Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs [2.3020018305241337]
Distilling explicit chain-of-thought reasoning paths has emerged as an effective method for improving the reasoning abilities of large language models.
We propose a novel approach to distill reasoning abilities from LLMs by leveraging their capacity to explain solutions.
Our experiments demonstrate that learning from explanations enables the Reasoner to more effectively guide program implementation by a Coder.
arXiv Detail & Related papers (2024-04-11T22:19:50Z) - Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems.
We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z) - Look Before You Leap: A Universal Emergent Decomposition of Retrieval
Tasks in Language Models [58.57279229066477]
We study how language models (LMs) solve retrieval tasks in diverse situations.
We introduce ORION, a collection of structured retrieval tasks spanning six domains.
We find that LMs internally decompose retrieval tasks in a modular way.
arXiv Detail & Related papers (2023-12-13T18:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.