Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBench
- URL: http://arxiv.org/abs/2406.12902v1
- Date: Mon, 10 Jun 2024 06:43:25 GMT
- Title: Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBench
- Authors: Jialun Cao, Zhiyong Chen, Jiarong Wu, Shing-chi Cheung, Chang Xu,
- Abstract summary: We propose JavaBench, a project-level Java benchmark that exercises OOP features.
It comprises four Java projects with 389 methods in 106 Java classes.
It is attested by 282 undergraduate students, reaching a 90.93/100 average score.
- Score: 22.95865189208591
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills, while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism). To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM's capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at https://github.com/java-bench/JavaBench.
Related papers
- EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code [37.712780804235045]
EffiBench-X is the first multi-language benchmark designed to measure the efficiency of LLM-generated code.<n>It supports Python, C++, Java, JavaScript, Ruby, and Golang.<n>It comprises competitive programming tasks with human-expert solutions as efficiency baselines.
arXiv Detail & Related papers (2025-05-19T11:43:37Z) - ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions [4.852619858744873]
Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis.
We introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages.
We evaluate our benchmark on six state-of-the-art code LLMs and see modest performance ranging from 19 to 38% (F1 score)
arXiv Detail & Related papers (2025-03-06T09:22:23Z) - EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.354203142828084]
We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models.
We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories.
Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%.
arXiv Detail & Related papers (2025-02-18T02:54:25Z) - CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.
The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - Escalating LLM-based Code Translation Benchmarking into the Class-level Era [20.22104136730419]
ClassEval-T is a class-level code translation benchmark for Large Language Models (LLMs)
Built upon ClassEval, ClassEval-T extends into Java and C++ with complete code samples and test suites, requiring 360 person-hours for manual migration.
arXiv Detail & Related papers (2024-11-09T11:13:14Z) - Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLMs [21.06722050714324]
This paper focuses on automation of test oracles for clients of widely used Java libraries, e.g., java.lang and java.util packages.
We use large language models as an enabling technology to embody our insight into a framework for test oracle automation.
arXiv Detail & Related papers (2024-11-04T04:24:25Z) - Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning [24.386388107656334]
We propose PROVE, a framework that uses program-based verification to filter out potentially incorrect reasoning paths.
Instead of relying on vanilla majority voting, our approach rejects solutions whose corresponding program outputs are inconsistent with the generated solution.
PROVE consistently outperforms vanilla voting as a majority for solving mathematical reasoning tasks across all datasets and model sizes.
arXiv Detail & Related papers (2024-10-16T14:24:55Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - LiveBench: A Challenging, Contamination-Free LLM Benchmark [101.21578097087699]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources.
We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size.
Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - OOP: Object-Oriented Programming Evaluation Benchmark for Large Language
Models [85.73744378691727]
This study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs.
We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures.
arXiv Detail & Related papers (2024-01-12T15:21:36Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - A Language Model of Java Methods with Train/Test Deduplication [5.529795221640365]
This tool demonstration presents a research toolkit for a language model of Java source code.
The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java.
arXiv Detail & Related papers (2023-05-15T00:22:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.