Related papers: Howzat? Appealing to Expert Judgement for Evaluating Human and AI Next-Step Hints for Novice Programmers

Howzat? Appealing to Expert Judgement for Evaluating Human and AI Next-Step Hints for Novice Programmers

URL: http://arxiv.org/abs/2411.18151v1
Date: Wed, 27 Nov 2024 08:59:34 GMT
Title: Howzat? Appealing to Expert Judgement for Evaluating Human and AI Next-Step Hints for Novice Programmers
Authors: Neil C. C. Brown, Pierre Weill-Tessier, Juho Leinonen, Paul Denny, Michael Kölling,
Abstract summary: It is important to know what makes a good hint and how to generate good hints automatically in novice programming tools. We recruited 44 Java educators from around the world to participate in an online study. Participants ranked a set of candidate next-step Java hints, which were generated by Large Language Models (LLMs) and by five human experienced educators.
Score: 3.2498303239935233
License:
Abstract: Motivation: Students learning to program often reach states where they are stuck and can make no forward progress. An automatically generated next-step hint can help them make forward progress and support their learning. It is important to know what makes a good hint or a bad hint, and how to generate good hints automatically in novice programming tools, for example using Large Language Models (LLMs). Method and participants: We recruited 44 Java educators from around the world to participate in an online study. We used a set of real student code states as hint-generation scenarios. Participants used a technique known as comparative judgement to rank a set of candidate next-step Java hints, which were generated by Large Language Models (LLMs) and by five human experienced educators. Participants ranked the hints without being told how they were generated. Findings: We found that LLMs had considerable variation in generating high quality next-step hints for programming novices, with GPT-4 outperforming other models tested. When used with a well-designed prompt, GPT-4 outperformed human experts in generating pedagogically valuable hints. A multi-stage prompt was the most effective LLM prompt. We found that the two most important factors of a good hint were length (80--160 words being best), and reading level (US grade 9 or below being best). Offering alternative approaches to solving the problem was considered bad, and we found no effect of sentiment. Conclusions: Automatic generation of these hints is immediately viable, given that LLMs outperformed humans -- even when the students' task is unknown. The fact that only the best prompts achieve this outcome suggests that students on their own are unlikely to be able to produce the same benefit. The prompting task, therefore, should be embedded in an expert-designed tool.

Related papers

One Step at a Time: Combining LLMs and Static Analysis to Generate Next-Step Hints for Programming Tasks [5.069252018619403]
Students often struggle with solving programming problems when learning to code, especially when they have to do it online. This help can be provided as next-step hint generation, showing a student what specific small step they need to do next to get to the correct solution. We propose a novel system to provide both textual and code hints for programming tasks.
arXiv Detail & Related papers (2024-10-11T21:41:57Z)
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback [54.10302745921713]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors. We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z)
Generating Feedback-Ladders for Logical Errors in Programming using Large Language Models [2.1485350418225244]
Large language model (LLM)-based methods have shown great promise in feedback generation for programming assignments. This paper explores using LLMs to generate a "feedback-ladder", i.e., multiple levels of feedback for the same problem-submission pair. We evaluate the quality of the generated feedback-ladder via a user study with students, educators, and researchers.
arXiv Detail & Related papers (2024-05-01T03:52:39Z)
Exploring How Multiple Levels of GPT-Generated Programming Hints Support or Disappoint Novices [0.0]
We investigated whether different levels of hints can support students' problem-solving and learning. We conducted a think-aloud study with 12 novices using the LLM Hint Factory. We discovered that high-level natural language hints alone can be helpless or even misleading.
arXiv Detail & Related papers (2024-04-02T18:05:26Z)
Next-Step Hint Generation for Introductory Programming Using Large Language Models [0.8002196839441036]
Large Language Models possess skills such as answering questions, writing essays or solving programming exercises. This work explores how LLMs can contribute to programming education by supporting students with automated next-step hints.
arXiv Detail & Related papers (2023-12-03T17:51:07Z)
Aligning Large Language Models with Human: A Survey [53.6014921995006]
Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks. Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect information. This survey presents a comprehensive overview of these alignment technologies, including the following aspects.
arXiv Detail & Related papers (2023-07-24T17:44:58Z)
Large Language Models are Zero-Shot Rankers for Recommender Systems [76.02500186203929]
This work aims to investigate the capacity of large language models (LLMs) to act as the ranking model for recommender systems. We show that LLMs have promising zero-shot ranking abilities but struggle to perceive the order of historical interactions. We demonstrate that these issues can be alleviated using specially designed prompting and bootstrapping strategies.
arXiv Detail & Related papers (2023-05-15T17:57:39Z)
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions. This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision. We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z)
Large Language Models Are Human-Level Prompt Engineers [31.98042013940282]
We propose Automatic Prompt Engineer for automatic instruction generation and selection. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness.
arXiv Detail & Related papers (2022-11-03T15:43:03Z)
RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning [84.75064077323098]
This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL) RLPrompt is flexibly applicable to different types of LMs, such as masked gibberish (e.g., grammaBERT) and left-to-right models (e.g., GPTs) Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods.
arXiv Detail & Related papers (2022-05-25T07:50:31Z)
Few-Shot Bot: Prompt-Based Learning for Dialogue Systems [58.27337673451943]
Learning to converse using only a few examples is a great challenge in conversational AI. The current best conversational models are either good chit-chatters (e.g., BlenderBot) or goal-oriented systems (e.g., MinTL) We propose prompt-based few-shot learning which does not require gradient-based fine-tuning but instead uses a few examples as the only source of learning.
arXiv Detail & Related papers (2021-10-15T14:36:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.