Related papers: PECC: Problem Extraction and Coding Challenges

PECC: Problem Extraction and Coding Challenges

URL: http://arxiv.org/abs/2404.18766v1
Date: Mon, 29 Apr 2024 15:02:14 GMT
Title: PECC: Problem Extraction and Coding Challenges
Authors: Patrick Haller, Jonas Golde, Alan Akbik,
Abstract summary: We introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate code. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset.
Score: 3.287942619833188
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs' capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.

Related papers

How to Get Your LLM to Generate Challenging Problems for Evaluation [33.625052642068624]
CHASE is a unified framework to synthetically generate challenging problems using Large Language Models. We implement CHASE to create evaluation benchmarks across three diverse domains. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy.
arXiv Detail & Related papers (2025-02-20T16:09:55Z)
Performance Review on LLM for solving leetcode problems [7.377558533352298]
This paper presents a comprehensive performance evaluation of Large Language Models (LLMs) in solving programming challenges from Leetcode. We generated solutions with multiple LLMs, including GPT-4 and GPT-3.5-turbo. Our results highlight the strengths and limitations of current LLMs in code generation and problem-solving tasks.
arXiv Detail & Related papers (2025-02-16T08:52:45Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. It is constructed using a dataset curated from 30 well-known GitHub repositories. We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns? [57.80779199039929]
Large Language Models (LLMs) have demonstrated remarkable performance in solving math problems. This paper introduces a novel benchmark, BeyondX, designed to address these limitations by incorporating problems with multiple unknowns. Empirical study on BeyondX reveals that the performance of existing LLMs, even those fine-tuned specifically on math tasks, significantly decreases as the number of unknowns increases.
arXiv Detail & Related papers (2024-07-06T17:01:04Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
Knowledge-Aware Code Generation with Large Language Models [34.806454393643236]
Large Language Models (LLMs) perform well on basic programming problems. However, they encounter challenges when dealing with complex tasks involving the use of diverse algorithmic and data structure skills. We develop a Knowledge Library tailored for Python programming contest problems and introduce the concept of Knowledge-Aware Code Generation.
arXiv Detail & Related papers (2024-01-29T08:01:22Z)
Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z)
ACES: Generating Diverse Programming Puzzles with with Autotelic Generative Models [20.039580079339537]
Autotelic CodE Search (ACES) jointly optimize for the diversity and difficulty of generated problems. We represent problems in a space of semantic descriptors describing the programming skills required to solve them. ACES iteratively prompts a large language model to generate difficult problems achieving a diversity of target semantic descriptors.
arXiv Detail & Related papers (2023-10-15T14:57:14Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.