Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBench
- URL: http://arxiv.org/abs/2406.12902v1
- Date: Mon, 10 Jun 2024 06:43:25 GMT
- Title: Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBench
- Authors: Jialun Cao, Zhiyong Chen, Jiarong Wu, Shing-chi Cheung, Chang Xu,
- Abstract summary: We propose JavaBench, a project-level Java benchmark that exercises OOP features.
It comprises four Java projects with 389 methods in 106 Java classes.
It is attested by 282 undergraduate students, reaching a 90.93/100 average score.
- Score: 22.95865189208591
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills, while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism). To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM's capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at https://github.com/java-bench/JavaBench.
Related papers
- LiveBench: A Challenging, Contamination-Free LLM Benchmark [101.21578097087699]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources.
We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size.
Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities.
Code reasoning is one of the most essential abilities of code LLMs.
We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z) - On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions [12.245958803682505]
Code completion is a key feature of Integrated Development Environments (IDEs)
Modern code completion approaches are often powered by deep learning (DL) models.
Can these models generalize across different language versions?
arXiv Detail & Related papers (2024-03-22T12:05:18Z) - BugsInPy: A Database of Existing Bugs in Python Programs to Enable
Controlled Testing and Debugging Studies [8.746971239693066]
For the first time, Python outperformed Java in Stack Overflow developer survey.
This is in stark contrast with the abundance of testing and debug tools for Java.
In this project, we create a benchmark database and tool that contain 493 real bugs from 17 real-world Python programs.
arXiv Detail & Related papers (2024-01-27T19:07:34Z) - OOP: Object-Oriented Programming Evaluation Benchmark for Large Language
Models [85.73744378691727]
This study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs.
We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures.
arXiv Detail & Related papers (2024-01-12T15:21:36Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - A Language Model of Java Methods with Train/Test Deduplication [5.529795221640365]
This tool demonstration presents a research toolkit for a language model of Java source code.
The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java.
arXiv Detail & Related papers (2023-05-15T00:22:02Z) - DS-1000: A Natural and Reliable Benchmark for Data Science Code
Generation [70.96868419971756]
DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries.
First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow.
Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
arXiv Detail & Related papers (2022-11-18T17:20:27Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.