Generating Planning Feedback for Open-Ended Programming Exercises with LLMs
- URL: http://arxiv.org/abs/2504.08958v1
- Date: Fri, 11 Apr 2025 20:26:49 GMT
- Title: Generating Planning Feedback for Open-Ended Programming Exercises with LLMs
- Authors: Mehmet Arif Demirtaş, Claire Zheng, Max Fowler, Kathryn Cunningham,
- Abstract summary: Large language models (LLM) may be able to generate feedback by detecting the overall code structure even for submissions with syntax errors.<n>We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy.<n>LLM may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps.
- Score: 1.2499537119440245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overall code structure even for submissions with syntax errors. To this end, we propose an approach that detects which high-level goals and patterns (i.e. programming plans) exist in a student program with LLMs. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy, outperforming baselines inspired by conventional approaches to code analysis. We further show that the smaller, cost-effective variant (GPT-4o-mini) achieves results on par with state-of-the-art (GPT-4o) after fine-tuning, creating promising implications for smaller models for real-time grading. These smaller models can be incorporated into autograders for open-ended code-writing exercises to provide feedback for students' implicit planning skills, even when their program is syntactically incorrect. Furthermore, LLMs may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps and iteratively compute the output, such as math and physics problems.
Related papers
- Self-Steering Language Models [113.96916935955842]
DisCIPL is a method for "self-steering" language models.<n>DisCIPL uses a Planner model to generate a task-specific inference program.<n>Our work opens up a design space of highly-parallelized Monte Carlo inference strategies.
arXiv Detail & Related papers (2025-04-09T17:54:22Z) - Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving [0.0]
Large Language Models (LLMs) have emerged as potential tools to automate feedback generation.<n>This study evaluates the performance of four LLMs on a benchmark dataset of 45 student solutions.
arXiv Detail & Related papers (2025-03-18T18:31:36Z) - Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant [0.0]
This article focuses on studying three aspects related to such an application.<n>The performance of two well-known models, GPT-3.5T and GPT-4T, in providing feedback to students is evaluated.
arXiv Detail & Related papers (2025-01-24T08:15:05Z) - Interactive and Expressive Code-Augmented Planning with Large Language Models [62.799579304821826]
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making.
Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance.
We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
arXiv Detail & Related papers (2024-11-21T04:23:17Z) - ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning [54.70811660561151]
Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples.
We seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program.
We observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
arXiv Detail & Related papers (2024-10-24T18:02:37Z) - A Picture Is Worth a Thousand Words: Exploring Diagram and Video-Based
OOP Exercises to Counter LLM Over-Reliance [2.1490831374964587]
Large language models (LLMs) can effectively solve a range of more complex object-oriented programming (OOP) exercises with text-based specifications.
This raises concerns about academic integrity, as students might use these models to complete assignments unethically.
We propose an innovative approach to formulating OOP tasks using diagrams and videos, as a way to foster problem-solving and deter students from a copy-and-prompt approach in OOP courses.
arXiv Detail & Related papers (2024-03-13T10:21:29Z) - Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input.
The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z) - Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning [75.74103236299477]
Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models.
We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI.
Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
arXiv Detail & Related papers (2023-06-04T17:02:59Z) - Generalized Planning in PDDL Domains with Pretrained Large Language
Models [82.24479434984426]
We consider PDDL domains and use GPT-4 to synthesize Python programs.
We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines.
arXiv Detail & Related papers (2023-05-18T14:48:20Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.