LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4
and Bard's Capacity to Handle Object-Oriented Programming Assignments
- URL: http://arxiv.org/abs/2403.06254v1
- Date: Sun, 10 Mar 2024 16:40:05 GMT
- Title: LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4
and Bard's Capacity to Handle Object-Oriented Programming Assignments
- Authors: Bruno Pereira Cipriano, Pedro Alves
- Abstract summary: Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments.
In this study, we experimented with three prominent LLMs to solve real-world OOP exercises used in educational settings.
The findings revealed that while the models frequently achieved mostly working solutions to the exercises, they often overlooked the best practices of OOP.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) have emerged as promising tools to assist
students while solving programming assignments. However, object-oriented
programming (OOP), with its inherent complexity involving the identification of
entities, relationships, and responsibilities, is not yet mastered by these
tools. Contrary to introductory programming exercises, there exists a research
gap with regard to the behavior of LLMs in OOP contexts. In this study, we
experimented with three prominent LLMs - GPT-3.5, GPT-4, and Bard - to solve
real-world OOP exercises used in educational settings, subsequently validating
their solutions using an Automatic Assessment Tool (AAT). The findings revealed
that while the models frequently achieved mostly working solutions to the
exercises, they often overlooked the best practices of OOP. GPT-4 stood out as
the most proficient, followed by GPT-3.5, with Bard trailing last. We advocate
for a renewed emphasis on code quality when employing these models and explore
the potential of pairing LLMs with AATs in pedagogical settings. In conclusion,
while GPT-4 showcases promise, the deployment of these models in OOP education
still mandates supervision.
Related papers
- See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop.
Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances.
We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z) - Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions [2.0411082897313984]
This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math.
By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated'student' model.
arXiv Detail & Related papers (2024-06-20T00:25:43Z) - Can large language models explore in-context? [87.49311128190143]
We deploy Large Language Models as agents in simple multi-armed bandit environments.
We find that the models do not robustly engage in exploration without substantial interventions.
arXiv Detail & Related papers (2024-03-22T17:50:43Z) - LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%.
We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE)
STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z) - Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input.
The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z) - OOP: Object-Oriented Programming Evaluation Benchmark for Large Language
Models [85.73744378691727]
This study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs.
We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures.
arXiv Detail & Related papers (2024-01-12T15:21:36Z) - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond [29.778018058541676]
GPT-Fathom is an open-source and reproducible evaluation suite for large language models (LLMs) built on top of OpenAI Evals.
We evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all aligned under settings.
arXiv Detail & Related papers (2023-09-28T16:43:35Z) - Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs)
We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score)
Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z) - GPT4Tools: Teaching Large Language Model to Use Tools via
Self-instruction [41.36474802204914]
GPT4Tools is based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools.
It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts.
arXiv Detail & Related papers (2023-05-30T05:27:21Z) - Generalized Planning in PDDL Domains with Pretrained Large Language
Models [82.24479434984426]
We consider PDDL domains and use GPT-4 to synthesize Python programs.
We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines.
arXiv Detail & Related papers (2023-05-18T14:48:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.