Related papers: StepGrade: Grading Programming Assignments with Context-Aware LLMs

StepGrade: Grading Programming Assignments with Context-Aware LLMs

URL: http://arxiv.org/abs/2503.20851v1
Date: Wed, 26 Mar 2025 17:36:26 GMT
Title: StepGrade: Grading Programming Assignments with Context-Aware LLMs
Authors: Mohammad Akyash, Kimia Zamiri Azar, Hadi Mardani Kamali,
Abstract summary: This study introduces StepGrade, which explores the use of Chain-of-Thought (CoT) prompting with Large Language Models (LLMs)<n>Unlike regular prompting, which offers limited and surface-level outputs, CoT prompting allows the model to reason step-by-step through the interconnected grading criteria.<n>To empirically validate the efficiency of StepGrade, we conducted a case study involving 30 Python programming assignments across three difficulty levels.
Score: 0.6725011823614421
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grading programming assignments is a labor-intensive and time-consuming process that demands careful evaluation across multiple dimensions of the code. To overcome these challenges, automated grading systems are leveraged to enhance efficiency and reduce the workload on educators. Traditional automated grading systems often focus solely on correctness, failing to provide interpretable evaluations or actionable feedback for students. This study introduces StepGrade, which explores the use of Chain-of-Thought (CoT) prompting with Large Language Models (LLMs) as an innovative solution to address these challenges. Unlike regular prompting, which offers limited and surface-level outputs, CoT prompting allows the model to reason step-by-step through the interconnected grading criteria, i.e., functionality, code quality, and algorithmic efficiency, ensuring a more comprehensive and transparent evaluation. This interconnectedness necessitates the use of CoT to systematically address each criterion while considering their mutual influence. To empirically validate the efficiency of StepGrade, we conducted a case study involving 30 Python programming assignments across three difficulty levels (easy, intermediate, and advanced). The approach is validated against expert human evaluations to assess its consistency, accuracy, and fairness. Results demonstrate that CoT prompting significantly outperforms regular prompting in both grading quality and interpretability. By reducing the time and effort required for manual grading, this research demonstrates the potential of GPT-4 with CoT prompting to revolutionize programming education through scalable and pedagogically effective automated grading systems.

Related papers

Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams [3.4132239125074206]
RATAS (Rubric Automated Tree-based Answer Scoring) is a novel framework that leverages state-of-the-art generative AI models for rubric-based grading of textual responses.<n> RATAS is designed to support a wide range of grading rubrics, enable subject-agnostic evaluation, and generate structured, explainable rationales for assigned scores.
arXiv Detail & Related papers (2025-05-27T22:17:27Z)
Probing the Unknown: Exploring Student Interactions with Probeable Problems at Scale in Introductory Programming [4.1153199495993364]
This study explores the use of Probeable Problems'', automatically gradable tasks that have deliberately vague or incomplete specifications. Such problems require students to submit test inputs, or probes', to clarify requirements before implementation. Systematic strategies, such as thoroughly exploring expected behavior before coding, resulted in fewer incorrect code submissions and correlated with course success.
arXiv Detail & Related papers (2025-04-16T02:50:00Z)
Synthesizing High-Quality Programming Tasks with LLM-based Expert and Student Agents [26.884829816265174]
PyTaskSyn is a novel synthesis technique that first generates a programming task and then decides whether it meets certain quality criteria to be given to students. We show that PyTaskSyn significantly improves task quality compared to baseline techniques and showcases the importance of each specialized agent type in our validation pipeline.
arXiv Detail & Related papers (2025-04-10T11:08:39Z)
CoTAL: Human-in-the-Loop Prompt Engineering, Chain-of-Thought Reasoning, and Active Learning for Generalizable Formative Assessment Scoring [2.249916681499244]
Methods such as chain-of-thought (CoT) prompting enable teachers to grade formative assessments in science. The extent to which these methods generalize across curricula in multiple domains remains largely untested. We introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring.
arXiv Detail & Related papers (2025-04-03T06:53:34Z)
Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique [66.94905631175209]
We propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL)<n>It employs self-generated natural language critiques as feedback to guide the step-level search process.<n>This approach bypasses the need for task-specific verifiers and the associated training overhead.
arXiv Detail & Related papers (2025-03-21T17:59:55Z)
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback [94.25162866972077]
Step-KTO is a training framework that combines process-level and outcome-level binary feedback.<n>Our experiments show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps.
arXiv Detail & Related papers (2025-01-18T15:38:03Z)
Think Beyond Size: Adaptive Prompting for More Effective Reasoning [0.0]
We introduce Adaptive Prompting, a dynamic and iterative framework designed to enhance reasoning by incorporating real-time adjustments to prompt structures and validation mechanisms.<n>Results demonstrate that Adaptive Prompting significantly improves performance on diverse reasoning benchmarks, including arithmetic reasoning (GSM8K, MultiArithm), logical reasoning and commonsense tasks.<n>Our approach enables smaller models to achieve competitive performance with larger counterparts, such as GPT-4, while maintaining computational efficiency.
arXiv Detail & Related papers (2024-10-10T17:14:36Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
Efficient Test-Time Prompt Tuning for Vision-Language Models [41.90997623029582]
Self-TPT is a framework leveraging Self-supervised learning for efficient Test-time Prompt Tuning. We show that Self-TPT not only significantly reduces inference costs but also achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-08-11T13:55:58Z)
AlphaMath Almost Zero: Process Supervision without Process [6.318873143509028]
We propose an innovative framework, AlphaMath, that bypasses the need for process annotations by leveraging Monte Carlo Tree Search (MCTS) This framework focuses on unleashing the potential of a well-pretrained LLM to autonomously enhance its mathematical reasoning. The experimental results on both in-domain and out-of-domain datasets demonstrate that even without GPT-4 or human-annotated process supervision, our AlphaMath framework achieves comparable or superior results to previous state-of-the-art methods.
arXiv Detail & Related papers (2024-05-06T15:20:30Z)
Overcoming Pitfalls in Graph Contrastive Learning Evaluation: Toward Comprehensive Benchmarks [60.82579717007963]
We introduce an enhanced evaluation framework designed to more accurately gauge the effectiveness, consistency, and overall capability of Graph Contrastive Learning (GCL) methods.
arXiv Detail & Related papers (2024-02-24T01:47:56Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
NEVIS'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision Research [96.53307645791179]
We introduce the Never-Ending VIsual-classification Stream (NEVIS'22), a benchmark consisting of a stream of over 100 visual classification tasks. Despite being limited to classification, the resulting stream has a rich diversity of tasks from OCR, to texture analysis, scene recognition, and so forth. Overall, NEVIS'22 poses an unprecedented challenge for current sequential learning approaches due to the scale and diversity of tasks.
arXiv Detail & Related papers (2022-11-15T18:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.