Related papers: ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions

ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions

URL: http://arxiv.org/abs/2503.04241v1
Date: Thu, 06 Mar 2025 09:22:23 GMT
Title: ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions
Authors: Julian Aron Prenner, Romain Robbes,
Abstract summary: Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis.<n>We introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages.<n>We evaluate our benchmark on six state-of-the-art code LLMs and see modest performance ranging from 19 to 38% (F1 score)
Score: 4.852619858744873
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code synthesis from natural language instructions. Hence, such benchmarks do not test for other forms of code understanding. Moreover, there have been concerns about contamination and leakage. That is, benchmark problems (or closely related problems) may appear in training set, strongly biasing benchmark results. In this work we investigate whether large language models can correctly predict runtime program behavior. To this end, we introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages. The majority of these programs throw an exception during runtime (due to a bug). LLMs are asked to predict whether a presented program throws an exception and, if so, which one. Evaluating our benchmark on six state-of-the-art code LLMs we see modest performance ranging from 19 to 38% (F1 score). Benchmarking a wider set of code capabilities could improve the assessment of code LLMs and help identify weak points in current models. Moreover, as ground-truth answers have been determined through program execution, leakage is not a concern. We release ThrowBench as well as all of our results together with this work.

Related papers

Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.<n>Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.<n>We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z)
Generating Unseen Code Tests In Infinitum [1.0674604700001968]
We present a method for creating benchmark variations that generalize across coding tasks and programming languages. We implement one benchmark, called textitauto-regression, for the task of text-to-code generation in Python.
arXiv Detail & Related papers (2024-07-29T08:11:20Z)
LiveBench: A Challenging, Contamination-Free LLM Benchmark [101.21578097087699]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z)
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation [61.350306618479365]
Leakage of benchmarks can prevent the accurate assessment of large language models' true performance. We propose Inference-Time Decontamination (ITD) to address this issue. ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU.
arXiv Detail & Related papers (2024-06-20T04:35:59Z)
CYCLE: Learning to Self-Refine the Code Generation [19.71833229434497]
We propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback. We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B benchmarks. The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs.
arXiv Detail & Related papers (2024-03-27T16:45:02Z)
Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. Code reasoning is one of the most essential abilities of code LLMs. We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.