Related papers: Competitive Programming with Large Reasoning Models

Competitive Programming with Large Reasoning Models

URL: http://arxiv.org/abs/2502.06807v2
Date: Tue, 18 Feb 2025 22:21:40 GMT
Title: Competitive Programming with Large Reasoning Models
Authors: OpenAI, :, Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou,
Abstract summary: We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks.<n>We compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi.<n>Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inferences.
Score: 73.7455809592467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.

Related papers

OJBench: A Competition Level Code Benchmark For Large Language Models [23.061564017225734]
OJBench is a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of large language models (LLMs)<n>We conduct a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models.<n>Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems.
arXiv Detail & Related papers (2025-06-19T15:27:02Z)
MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities [0.0]
We propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora.<n>Specifically, cross-entropy (CE) loss is applied to domain-corpus to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities.
arXiv Detail & Related papers (2025-05-17T15:12:47Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM. START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks [7.686622572497795]
Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback. We show that performance on Arena Hard, a benchmark of Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses.
arXiv Detail & Related papers (2025-03-06T12:30:24Z)
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments.<n>CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z)
Double Oracle Neural Architecture Search for Game Theoretic Deep Learning Models [28.238075755838487]
We propose a new approach to train deep learning models using game theory concepts. We deploy a double-versarial framework using best response oracles. We show that all our variants have significant improvements in both subjective qualitative evaluation and quantitative metrics.
arXiv Detail & Related papers (2024-10-07T05:42:01Z)
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training [73.90260246781435]
We present Lory, the first approach that scales such architectures to autoregressive language model pre-training. We show significant performance gains over parameter-matched dense models on both perplexity and a variety of downstream tasks. Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing.
arXiv Detail & Related papers (2024-05-06T03:06:33Z)
Double A3C: Deep Reinforcement Learning on OpenAI Gym Games [0.0]
Reinforcement Learning (RL) is an area of machine learning figuring out how agents take actions in an unknown environment to maximize its rewards. We will propose and implement an improved version of Double A3C algorithm which utilizing the strength of both algorithms to play OpenAI Gym Atari 2600 games to beat its benchmarks.
arXiv Detail & Related papers (2023-03-04T00:06:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.