Related papers: ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

URL: http://arxiv.org/abs/2506.09050v1
Date: Tue, 10 Jun 2025 17:59:56 GMT
Title: ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
Authors: Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, Takuya Akiba,
Abstract summary: We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests.<n>ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution.<n>Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations.
Score: 1.6932802756478724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

Related papers

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? [88.29001498765629]
Large language models (LLMs) now outperform elite humans in competitive programming.<n>We revisit this claim, examining how LLMs differ from human experts and where limitations still remain.<n>We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI.<n>A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions.
arXiv Detail & Related papers (2025-06-13T16:29:09Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Efficiently Scaling LLM Reasoning with Certaindex [25.549811985276488]
Test-time reasoning algorithms can wastefully generate many tokens without improving accuracy.<n>We introduce Certaindex, an algorithm-agnostic metric measuring when further computation is unlikely to alter the final result.<n>Certaindex is lightweight, can accelerate reasoning program inference via early exit, and enables dynamic token allocation.
arXiv Detail & Related papers (2024-12-30T14:57:53Z)
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z)
Quantum Algorithm Exploration using Application-Oriented Performance Benchmarks [0.0]
The QED-C suite of Application-Oriented Benchmarks provides the ability to gauge performance characteristics of quantum computers. We investigate challenges in broadening the relevance of this benchmarking methodology to applications of greater complexity.
arXiv Detail & Related papers (2024-02-14T06:55:50Z)
SEGO: Sequential Subgoal Optimization for Mathematical Problem-Solving [64.38649623473626]
Large Language Models (LLMs) have driven substantial progress in artificial intelligence. We propose a novel framework called textbfSEquential subtextbfGoal textbfOptimization (SEGO) to enhance LLMs' ability to solve mathematical problems.
arXiv Detail & Related papers (2023-10-19T17:56:40Z)
Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop Visual Reasoning [16.495754104540605]
Large language models (LLMs) can generate code-like plans for complex inference tasks such as visual reasoning. We propose a hierarchical plan-searching algorithm that integrates the one-stop reasoning (fast) and the Tree-of-thought (slow)
arXiv Detail & Related papers (2023-08-18T16:21:40Z)
Qubit efficient quantum algorithms for the vehicle routing problem on NISQ processors [48.68474702382697]
Vehicle routing problem with time windows (VRPTW) is a common optimization problem faced within the logistics industry. In this work, we explore the use of a previously-introduced qubit encoding scheme to reduce the number of binary variables.
arXiv Detail & Related papers (2023-06-14T13:44:35Z)
Robust expected improvement for Bayesian optimization [1.8130068086063336]
We propose a surrogate modeling and active learning technique called robust expected improvement (REI) that ports adversarial methodology into the BO/GP framework. We illustrate and draw comparisons to several competitors on benchmark synthetic exercises and real problems of varying complexity.
arXiv Detail & Related papers (2023-02-16T22:34:28Z)
An Efficient Merge Search Matheuristic for Maximising the Net Present Value of Project Schedules [5.10800491975164]
Resource constrained project scheduling is an important optimisation problem with many practical applications. We propose a new math-heuristic algorithm based on Merge Search and parallel computing to solve the resource constrained project scheduling.
arXiv Detail & Related papers (2022-10-20T13:30:23Z)
Design-Bench: Benchmarks for Data-Driven Offline Model-Based Optimization [82.02008764719896]
Black-box model-based optimization problems are ubiquitous in a wide range of domains, such as the design of proteins, DNA sequences, aircraft, and robots. We present Design-Bench, a benchmark for offline MBO with a unified evaluation protocol and reference implementations of recent methods. Our benchmark includes a suite of diverse and realistic tasks derived from real-world optimization problems in biology, materials science, and robotics.
arXiv Detail & Related papers (2022-02-17T05:33:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.