Related papers: BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

URL: http://arxiv.org/abs/2503.15242v2
Date: Thu, 20 Mar 2025 17:58:17 GMT
Title: BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?
Authors: Pierre Chambon, Baptiste Roziere, Benoit Sagot, Gabriel Synnaeve,
Abstract summary: BigO(Bench) is a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities.<n>BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements.<n>We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements.
Score: 20.550427148810556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

Related papers

Program Semantic Inequivalence Game with Large Language Models [10.358176296850639]
Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics.<n>In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ.<n>We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources.
arXiv Detail & Related papers (2025-05-02T20:03:35Z)
PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving [50.50405233978406]
We propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG)<n>OVPG aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks.<n>Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples.
arXiv Detail & Related papers (2025-04-15T05:29:31Z)
EpiCoder: Encompassing Diversity and Complexity in Code Generation [49.170195362149386]
We introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST) Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements. We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels.
arXiv Detail & Related papers (2025-01-08T18:58:15Z)
Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process. This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z)
MapCoder: Multi-Agent Code Generation for Competitive Problem Solving [3.3856216159724983]
We introduce a new approach to code generation tasks leveraging multi-agent prompting. Our framework, MapCoder, consists of four LLM agents specifically designed to emulate the stages of program synthesis. Our method consistently delivers superior performance across various programming languages.
arXiv Detail & Related papers (2024-05-18T22:10:15Z)
CoCoST: Automatic Complex Code Generation with Online Searching and Correctness Testing [51.00909683314142]
Large Language Models have revolutionized code generation ability by converting natural language descriptions into executable code. CoCoST framework enhances complex code generation by online searching for more information with planned queries and correctness testing for code refinement. CoCoST is validated through rigorous experiments on the DS-1000 and ClassEval datasets.
arXiv Detail & Related papers (2024-03-20T13:33:55Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
CodeComplex: Dataset for Worst-Case Time Complexity Prediction [7.974618854858136]
Code time complexity prediction involves various intricate factors such as the input range of variables and conditional loops.<n>Current benchmarks fall short of providing a rigorous assessment due to limited data, language constraints, and insufficient labeling.<n>We introduce CodeComplex, the first robust and extensive dataset designed to evaluate LLMs' reasoning abilities in predicting code time complexity.
arXiv Detail & Related papers (2024-01-16T06:54:44Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)
When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity. Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z)
TASTY: A Transformer based Approach to Space and Time complexity [0.4724825031148411]
Code based Language Models (LMs) have shown very promising results in the field of software engineering. We create a labelled dataset of code snippets spanning multiple languages. We propose to use LMs to find space complexities from code, and to the best of our knowledge, this is the first attempt to do so.
arXiv Detail & Related papers (2023-05-06T03:37:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.