Related papers: FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

URL: http://arxiv.org/abs/2507.13337v1
Date: Thu, 17 Jul 2025 17:53:55 GMT
Title: FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming
Authors: Gal Beniamini, Yuval Dor, Alon Vinnikov, Shir Granot Peled, Or Weinstein, Or Sharir, Noam Wies, Tomer Nussbaum, Ido Ben Shaul, Tomer Zekharya, Yoav Levine, Shai Shalev-Shwartz, Amnon Shashua,
Abstract summary: FormulaOne is a benchmark for graph theory, logic, and algorithms.<n>Our problems are incredibly demanding, requiring an array of reasoning steps.<n>Remarkably, state-of-the-art models like OpenAI's o3 fail entirely on FormulaOne.
Score: 19.576944188747166
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human -- or superhuman -- expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly expressive framework of Monadic Second-Order (MSO) logic on graphs, paving the way toward automatic problem generation at scale; ideal for building RL environments. Third, many of our problems are intimately related to the frontier of theoretical computer science, and to central conjectures therein, such as the Strong Exponential Time Hypothesis (SETH). As such, any significant algorithmic progress on our dataset, beyond known results, could carry profound theoretical implications. Remarkably, state-of-the-art models like OpenAI's o3 fail entirely on FormulaOne, solving less than 1% of the questions, even when given 10 attempts and explanatory fewshot examples -- highlighting how far they remain from expert-level understanding in some domains. To support further research, we additionally curate FormulaOne-Warmup, offering a set of simpler tasks, from the same distribution. We release the full corpus along with a comprehensive evaluation framework.

Related papers

Vibe Reasoning: Eliciting Frontier AI Mathematical Capabilities -- A Case Study on IMO 2025 Problem 6 [28.84243696489176]
We introduce Vibe Reasoning, a human-AI collaborative paradigm for solving complex mathematical problems.<n>We demonstrate this paradigm through IMO 2025 Problem 6, a optimization problem where autonomous AI systems publicly reported failures.
arXiv Detail & Related papers (2025-12-22T11:30:19Z)
FrontierCS: Evolving Challenges for Evolving Intelligence [174.80075821079708]
We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science.<n>For each problem we provide an expert reference solution and an automatic evaluator.<n>We find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks.
arXiv Detail & Related papers (2025-12-17T18:52:45Z)
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving [37.708900742664184]
We introduce EngiBench, a hierarchical benchmark to evaluate large language models (LLMs) on solving engineering problems.<n>It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields.<n>Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks.
arXiv Detail & Related papers (2025-09-22T12:20:27Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics [13.530403536762064]
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology.<n>The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level.<n>We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen.
arXiv Detail & Related papers (2025-02-19T19:00:00Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
Supervised Chain of Thought [5.389461633686935]
Chain of Thought (CoT) prompting offers a promising approach to solving complex reasoning tasks. One-prompt-for-all approach poses significant challenges for models to generate the correct reasoning steps. We show how task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance.
arXiv Detail & Related papers (2024-10-18T06:25:27Z)
Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths [69.39559168050923]
We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
arXiv Detail & Related papers (2024-10-07T06:37:25Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation. We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets. We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
Towards a Holistic Understanding of Mathematical Questions with Contrastive Pre-training [65.10741459705739]
We propose a novel contrastive pre-training approach for mathematical question representations, namely QuesCo. We first design two-level question augmentations, including content-level and structure-level, which generate literally diverse question pairs with similar purposes. Then, to fully exploit hierarchical information of knowledge concepts, we propose a knowledge hierarchy-aware rank strategy.
arXiv Detail & Related papers (2023-01-18T14:23:29Z)
On Theoretical Complexity and Boolean Satisfiability [0.0]
This thesis introduces some of the most central concepts in the Theory of Computing. We then explore some of its tractable as well as intractable variants such as Horn-SAT and 3-SAT. Finally, we establish reductions from 3-SAT to some of the famous NP-complete graph problems.
arXiv Detail & Related papers (2021-12-22T10:13:34Z)
How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI [32.54495599722743]
We propose a new reasoning challenge, namely Fermi Problems (FPs) FPs are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. We present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge.
arXiv Detail & Related papers (2021-10-27T06:39:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.