Related papers: Bridging Gaps Between Student and Expert Evaluations of AI-Generated Programming Hints

Bridging Gaps Between Student and Expert Evaluations of AI-Generated Programming Hints

URL: http://arxiv.org/abs/2509.03269v1
Date: Wed, 03 Sep 2025 12:38:35 GMT
Title: Bridging Gaps Between Student and Expert Evaluations of AI-Generated Programming Hints
Authors: Tung Phung, Mengyan Wu, Heeryung Choi, Gustavo Soares, Sumit Gulwani, Adish Singla, Christopher Brooks,
Abstract summary: We study mismatches in perceived hint quality from students' and experts' perspectives.<n>We propose and discuss preliminary results on potential methods to bridge these gaps.
Score: 21.254611931654132
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative AI has the potential to enhance education by providing personalized feedback to students at scale. Recent work has proposed techniques to improve AI-generated programming hints and has evaluated their performance based on expert-designed rubrics or student ratings. However, it remains unclear how the rubrics used to design these techniques align with students' perceived helpfulness of hints. In this paper, we systematically study the mismatches in perceived hint quality from students' and experts' perspectives based on the deployment of AI-generated hints in a Python programming course. We analyze scenarios with discrepancies between student and expert evaluations, in particular, where experts rated a hint as high-quality while the student found it unhelpful. We identify key reasons for these discrepancies and classify them into categories, such as hints not accounting for the student's main concern or not considering previous help requests. Finally, we propose and discuss preliminary results on potential methods to bridge these gaps, first by extending the expert-designed quality rubric and then by adapting the hint generation process, e.g., incorporating the student's comments or history. These efforts contribute toward scalable, personalized, and pedagogically sound AI-assisted feedback systems, which are particularly important for high-enrollment educational settings.

Related papers

ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review [48.60540055009675]
ScholarPeer is a search-enabled multi-agent framework designed to emulate the cognitive processes of a senior researcher.<n>We evaluate ScholarPeer on DeepReview-13K and the results demonstrate that ScholarPeer achieves significant win-rates against state-of-the-art approaches in side-by-side evaluations.
arXiv Detail & Related papers (2026-01-30T06:54:55Z)
A Survey on Feedback Types in Automated Programming Assessment Systems [3.9845307287664973]
This study investigates how different feedback mechanisms in APASs are perceived by students, and how effective they are in supporting problem-solving.<n>Results indicate that while students rate unit test feedback as the most helpful, AI-generated feedback leads to significantly better performances.
arXiv Detail & Related papers (2025-10-21T09:08:22Z)
CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection [60.52240468810558]
We introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews.<n>We also develop CoCoDet, an AI review detector via a multi-task learning framework, to achieve more accurate and robust detection of AI involvement in review content.
arXiv Detail & Related papers (2025-08-28T06:03:11Z)
Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education [3.557803321422781]
This article presents a scalable, AI-supported framework for qualitative student feedback using large language models.<n>The system employs hierarchical summarization, anonymization, and exception handling to extract actionable themes from open-ended comments.<n>We report on its successful deployment across a large college of engineering.
arXiv Detail & Related papers (2025-08-01T20:27:40Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
From Coders to Critics: Empowering Students through Peer Assessment in the Age of AI Copilots [3.3094795918443634]
This paper presents an empirical study of a rubric based, anonymized peer review process implemented in a large programming course.<n>Students evaluated each other's final projects (2D game) and their assessments were compared to instructor grades using correlation, mean absolute error, and root mean square error (RMSE)<n>Results show that peer review can approximate instructor evaluation with moderate accuracy and foster student engagement, evaluative thinking, and interest in providing good feedback to their peers.
arXiv Detail & Related papers (2025-05-28T08:17:05Z)
Level Up Peer Review in Education: Investigating genAI-driven Gamification system and its influence on Peer Feedback Effectiveness [0.8087870525861938]
This paper introduces Socratique, a gamified peer-assessment platform integrated with Generative AI (GenAI) assistance.<n>By incorporating game elements, Socratique aims to motivate students to provide more feedback.<n>Students in the treatment group provided significantly more voluntary feedback, with higher scores on clarity, relevance, and specificity.
arXiv Detail & Related papers (2025-04-03T18:30:25Z)
Beyond Detection: Designing AI-Resilient Assessments with Automated Feedback Tool to Foster Critical Thinking [0.0]
This research proposes a proactive, AI-resilient solution based on assessment design rather than detection.<n>It introduces a web-based Python tool that integrates Bloom's taxonomy with advanced natural language processing techniques.<n>It helps educators determine whether a task targets lower-order thinking such as recall and summarization or higher-order skills such as analysis, evaluation, and creation.
arXiv Detail & Related papers (2025-03-30T23:13:00Z)
The Superalignment of Superhuman Intelligence with Large Language Models [63.96120398355404]
We discuss the concept of superalignment from the learning perspective to answer this question.<n>We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation.<n>We present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing.
arXiv Detail & Related papers (2024-12-15T10:34:06Z)
Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [176.39275404745098]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.<n>GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.<n>Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z)
Identifying Student Profiles Within Online Judge Systems Using Explainable Artificial Intelligence [6.638206014723678]
Online Judge (OJ) systems are typically considered within programming-related courses as they yield fast and objective assessments of the code developed by the students. This work aims to tackle this limitation by considering the further exploitation of the information gathered by the OJ and automatically inferring feedback for both the student and the instructor.
arXiv Detail & Related papers (2024-01-29T12:11:30Z)
Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric. We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z)
ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification. A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z)
Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap. We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.