Related papers: Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

URL: http://arxiv.org/abs/2510.14232v1
Date: Thu, 16 Oct 2025 02:19:25 GMT
Title: Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
Authors: Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Uddin Ahmad, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg,
Abstract summary: GenCluster is a test-time compute framework that attains IOI gold-level performance using open-weight models.<n>We will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model.
Score: 72.52332895840279
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present \gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.

Related papers

DAJ: Data-Reweighted LLM Judge for Test-Time Scaling in Code Generation [30.131052926559956]
We propose DAJ, a reasoning-based LLM judge trained with rewards under a bi-level data-reweighted learning framework.<n>Our approach automatically emphasizes hard problems, in-distribution samples, and trajectory-aligned data, without relying on hand-crafted verifiables.
arXiv Detail & Related papers (2026-01-29T19:04:24Z)
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking [84.07076200941474]
ArenaRL is a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking.<n>We construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals.<n>Experiments show that ArenaRL substantially outperforms standard RL baselines.
arXiv Detail & Related papers (2026-01-10T08:43:07Z)
LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics [23.99262273166077]
Large Language Models (LLMs) and diverse specialized benchmarks require a shift from fragmented, task-specific metrics to a holistic, competitive ranking system.<n>We introduce the novel Competitive Swiss-System Dynamics (CSD) framework, which simulates a sequential contest.<n>CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models.
arXiv Detail & Related papers (2025-12-24T07:14:31Z)
OJBench: A Competition Level Code Benchmark For Large Language Models [23.061564017225734]
OJBench is a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of large language models (LLMs)<n>We conduct a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models.<n>Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems.
arXiv Detail & Related papers (2025-06-19T15:27:02Z)
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? [88.29001498765629]
Large language models (LLMs) now outperform elite humans in competitive programming.<n>We revisit this claim, examining how LLMs differ from human experts and where limitations still remain.<n>We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI.<n>A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions.
arXiv Detail & Related papers (2025-06-13T16:29:09Z)
OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics [13.049841309304922]
This paper introduces OIBench, a high-quality, private, and challenging olympiad-level informatics dataset comprising 250 carefully curated original problems.<n>We detail the construction methodology of the benchmark, ensuring a comprehensive assessment across various programming paradigms and complexities.<n>We propose Time/Space Completion Curves for finer-grained efficiency analysis and enable direct human-model comparisons.
arXiv Detail & Related papers (2025-06-12T08:33:38Z)
Competitive Programming with Large Reasoning Models [73.7455809592467]
We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks.<n>We compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi.<n>Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inferences.
arXiv Detail & Related papers (2025-02-03T23:00:15Z)
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments.<n>CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z)
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z)
Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control [1.1404490220482764]
BRO is a model-free algorithm to achieve near-optimal policies in the Dog and Humanoid tasks.<n>BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms.<n>BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
arXiv Detail & Related papers (2024-05-25T09:53:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.