OOP: Object-Oriented Programming Evaluation Benchmark for Large Language
Models
- URL: http://arxiv.org/abs/2401.06628v2
- Date: Wed, 21 Feb 2024 06:18:16 GMT
- Title: OOP: Object-Oriented Programming Evaluation Benchmark for Large Language
Models
- Authors: Shuai Wang, Liang Ding, Li Shen, Yong Luo, Bo Du, Dacheng Tao
- Abstract summary: This study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs.
We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures.
- Score: 85.73744378691727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancing automated programming necessitates robust and comprehensive code
generation benchmarks, yet current evaluation frameworks largely neglect
object-oriented programming (OOP) in favor of functional programming (FP),
e.g., HumanEval and MBPP. To address this, our study introduces a pioneering
OOP-focused benchmark, featuring 431 Python programs that encompass essential
OOP concepts and features like classes and encapsulation methods. We propose a
novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k
measures. Our evaluation of 23 leading large language models (LLMs), including
both general and code-specialized models, reveals three key insights: 1) pass@o
offers a more relevant and comprehensive assessment for OOP code generation; 2)
Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP
compared to models like ChatGPT; 3) The poor performance of all advanced LLMs
on our OOP benchmark highlights a critical need for improvements in this field.
Our benchmark and scripts are publicly released at:
https://github.com/alphadl/OOP-eval.
Related papers
- MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation [5.6001617185032595]
Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks.
We fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank fashion on consumer-grade hardware to improve review comment generation.
arXiv Detail & Related papers (2024-11-15T12:01:38Z) - Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Object-Oriented Programming [17.62426370778165]
Object-Oriented Programming (OOP) has become a crucial paradigm for managing the growing complexity of modern software systems.
This work provides a comprehensive introduction to the integration of OOP techniques within these domains.
We examine how design patterns and modular programming can be employed to enhance the structure and efficiency of machine learning systems.
arXiv Detail & Related papers (2024-09-30T03:37:10Z) - LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4
and Bard's Capacity to Handle Object-Oriented Programming Assignments [0.0]
Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments.
In this study, we experimented with three prominent LLMs to solve real-world OOP exercises used in educational settings.
The findings revealed that while the models frequently achieved mostly working solutions to the exercises, they often overlooked the best practices of OOP.
arXiv Detail & Related papers (2024-03-10T16:40:05Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs [1.9207412600219353]
We evaluate two popular benchmarks for Python code generation, analyzing their diversity and difficulty.
Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely.
We propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts.
arXiv Detail & Related papers (2024-01-08T12:36:43Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models [43.655927559990616]
We propose CodeApex, a benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs.
We evaluate 12 widely used LLMs, including both general-purpose and specialized models.
GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively.
arXiv Detail & Related papers (2023-09-05T04:12:01Z) - OPT-IML: Scaling Language Model Instruction Meta Learning through the
Lens of Generalization [101.37439352091612]
We describe the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes.
We present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT.
arXiv Detail & Related papers (2022-12-22T19:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.