Related papers: LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4 and Bard's Capacity to Handle Object-Oriented Programming Assignments

LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4 and Bard's Capacity to Handle Object-Oriented Programming Assignments

URL: http://arxiv.org/abs/2403.06254v1
Date: Sun, 10 Mar 2024 16:40:05 GMT
Title: LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4 and Bard's Capacity to Handle Object-Oriented Programming Assignments
Authors: Bruno Pereira Cipriano, Pedro Alves
Abstract summary: Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments. In this study, we experimented with three prominent LLMs to solve real-world OOP exercises used in educational settings. The findings revealed that while the models frequently achieved mostly working solutions to the exercises, they often overlooked the best practices of OOP.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments. However, object-oriented programming (OOP), with its inherent complexity involving the identification of entities, relationships, and responsibilities, is not yet mastered by these tools. Contrary to introductory programming exercises, there exists a research gap with regard to the behavior of LLMs in OOP contexts. In this study, we experimented with three prominent LLMs - GPT-3.5, GPT-4, and Bard - to solve real-world OOP exercises used in educational settings, subsequently validating their solutions using an Automatic Assessment Tool (AAT). The findings revealed that while the models frequently achieved mostly working solutions to the exercises, they often overlooked the best practices of OOP. GPT-4 stood out as the most proficient, followed by GPT-3.5, with Bard trailing last. We advocate for a renewed emphasis on code quality when employing these models and explore the potential of pairing LLMs with AATs in pedagogical settings. In conclusion, while GPT-4 showcases promise, the deployment of these models in OOP education still mandates supervision.

Related papers

Generating Planning Feedback for Open-Ended Programming Exercises with LLMs [1.2499537119440245]
Large language models (LLM) may be able to generate feedback by detecting the overall code structure even for submissions with syntax errors. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy. LLM may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps.
arXiv Detail & Related papers (2025-04-11T20:26:49Z)
Open, Small, Rigmarole -- Evaluating Llama 3.2 3B's Feedback for Programming Exercises [0.0]
Large Language Models (LLMs) have been subject to extensive research in the past few years. This study explores the feedback characteristics of the open, lightweight LLM Llama 3.2 (3B)
arXiv Detail & Related papers (2025-04-01T17:24:39Z)
Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant [0.0]
This article focuses on studying three aspects related to such an application. The performance of two well-known models, GPT-3.5T and GPT-4T, in providing feedback to students is evaluated.
arXiv Detail & Related papers (2025-01-24T08:15:05Z)
See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances. We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z)
Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions [2.0411082897313984]
This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math. By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated'student' model.
arXiv Detail & Related papers (2024-06-20T00:25:43Z)
Can large language models explore in-context? [87.49311128190143]
We deploy Large Language Models as agents in simple multi-armed bandit environments. We find that the models do not robustly engage in exploration without substantial interventions.
arXiv Detail & Related papers (2024-03-22T17:50:43Z)
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE) STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z)
Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z)
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models [85.73744378691727]
This study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures.
arXiv Detail & Related papers (2024-01-12T15:21:36Z)
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond [29.778018058541676]
GPT-Fathom is an open-source and reproducible evaluation suite for large language models (LLMs) built on top of OpenAI Evals. We evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all aligned under settings.
arXiv Detail & Related papers (2023-09-28T16:43:35Z)
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs) We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score) Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z)
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [41.36474802204914]
GPT4Tools is based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools. It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts.
arXiv Detail & Related papers (2023-05-30T05:27:21Z)
Generalized Planning in PDDL Domains with Pretrained Large Language Models [82.24479434984426]
We consider PDDL domains and use GPT-4 to synthesize Python programs. We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines.
arXiv Detail & Related papers (2023-05-18T14:48:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.