AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming
- URL: http://arxiv.org/abs/2507.15378v1
- Date: Mon, 21 Jul 2025 08:34:20 GMT
- Title: AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming
- Authors: Jierui Li, Raymond Mooney,
- Abstract summary: We introduce AlgoSimBench, a new benchmark designed to assess ability to identify algorithmically similar problems (ASPs)<n>AlgoSimBench consists of 1317 problems, annotated with distinct fine-grained algorithm tags, from which we distract 402 multiple-choice questions (MCQs)<n>Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9% accuracy on the MCQ task.<n>We propose attempted solution matching (ASM), a novel method for improving problem similarity detection.
- Score: 2.3020018305241337
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent progress in LLMs, such as reasoning models, has demonstrated strong abilities to solve complex competitive programming problems, often rivaling top human competitors. However, it remains underexplored whether these abilities generalize to relevant domains that are less seen during training. To address this, we introduce AlgoSimBench, a new benchmark designed to assess LLMs' ability to identify algorithmically similar problems (ASPs)-problems that can be solved using similar algorithmic approaches. AlgoSimBench consists of 1317 problems, annotated with 231 distinct fine-grained algorithm tags, from which we curate 402 multiple-choice questions (MCQs), where each question presents one algorithmically similar problem alongside three textually similar but algorithmically dissimilar distractors. Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9% accuracy on the MCQ task. To address this challenge, we propose attempted solution matching (ASM), a novel method for improving problem similarity detection. On our MCQ task, ASM yields an absolute accuracy improvement of 6.7% to 11.7% across different models. We also evaluated code embedding models and retrieval methods on similar problem identification. While the adversarial selection of problems degrades the performance to be less than random, we found that simply summarizing the problem to remove narrative elements eliminates the effect, and combining ASM with a keyword-prioritized method, BM25, can yield up to 52.2% accuracy. Code and data are available at github.com
Related papers
- Evaluating and Improving Large Language Models for Competitive Program Generation [18.564450345359468]
This study aims to evaluate and improve large language models (LLMs) in solving real-world competitive programming problems.<n>We collect 117 problems from nine regional ICPC/CCPC contests held in 2024 and design four filtering criteria to construct a curated benchmark consisting of 80 problems.<n>We evaluate its competitive program generation capabilities through the online judge (OJ) platforms, guided by a carefully designed basic prompt.
arXiv Detail & Related papers (2025-06-28T17:18:23Z) - Strategic Scaling of Test-Time Compute: A Bandit Learning Approach [13.735277588793995]
Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models.<n>We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute accordingly.<n>Our algorithms achieve up to an 11.10% performance improvement on the MATH-500 dataset and up to a 7.41% performance improvement on LiveCodeBench.
arXiv Detail & Related papers (2025-06-15T04:55:49Z) - Diverse Inference and Verification for Advanced Reasoning [19.88677753421871]
Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding.<n>We use a diverse inference approach that combines multiple models and methods at test time.<n>We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective.
arXiv Detail & Related papers (2025-02-14T07:22:25Z) - Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types.
We show that even the most advanced LLMs fail to solve these problems end-to-end in text.
Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z) - Can Language Models Solve Olympiad Programming? [40.54366634332231]
This paper introduces the USACO benchmark with 307 problems from the USA Computing Olympiad.
We construct and test a range of LM inference methods for competitive programming for the first time.
We find GPT-4 only achieves a 8.7% pass@1 accuracy with zero-shot chain-of-thought prompting.
arXiv Detail & Related papers (2024-04-16T23:27:38Z) - Large Language Model-Enhanced Algorithm Selection: Towards Comprehensive Algorithm Representation [27.378185644892984]
This paper introduces Large Language Models (LLMs) into algorithm selection for the first time.
LLMs not only captures the structural and semantic aspects of the algorithm, but also demonstrates contextual awareness and library function understanding.
The selected algorithm is determined by the matching degree between a given problem and different algorithms.
arXiv Detail & Related papers (2023-11-22T06:23:18Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [70.45113777449373]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.<n>Key challenge in comparing existing algorithms and developing better algorithms is the lack of publicly available gold-standard data.<n>We collect a novel dataset of similarity scores that we release to the research community.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Algorithm Selection on a Meta Level [58.720142291102135]
We introduce the problem of meta algorithm selection, which essentially asks for the best way to combine a given set of algorithm selectors.
We present a general methodological framework for meta algorithm selection as well as several concrete learning methods as instantiations of this framework.
arXiv Detail & Related papers (2021-07-20T11:23:21Z) - Towards Meta-Algorithm Selection [78.13985819417974]
Instance-specific algorithm selection (AS) deals with the automatic selection of an algorithm from a fixed set of candidates.
We show that meta-algorithm-selection can indeed prove beneficial in some cases.
arXiv Detail & Related papers (2020-11-17T17:27:33Z) - Extreme Algorithm Selection With Dyadic Feature Representation [78.13985819417974]
We propose the setting of extreme algorithm selection (XAS) where we consider fixed sets of thousands of candidate algorithms.
We assess the applicability of state-of-the-art AS techniques to the XAS setting and propose approaches leveraging a dyadic feature representation.
arXiv Detail & Related papers (2020-01-29T09:40:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.