Related papers: Predicting At-Risk Programming Students in Small Imbalanced Datasets using Synthetic Data

Predicting At-Risk Programming Students in Small Imbalanced Datasets using Synthetic Data

URL: http://arxiv.org/abs/2505.17128v1
Date: Wed, 21 May 2025 23:14:25 GMT
Title: Predicting At-Risk Programming Students in Small Imbalanced Datasets using Synthetic Data
Authors: Daniel Flood, Matthew England, Beate Grawemeyer,
Abstract summary: This study is part of a larger project focused on measuring, understanding, and improving student engagement in programming education.<n>We investigate whether synthetic data generation can help identify at-risk students earlier in a small, imbalanced dataset from an introductory programming module.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This study is part of a larger project focused on measuring, understanding, and improving student engagement in programming education. We investigate whether synthetic data generation can help identify at-risk students earlier in a small, imbalanced dataset from an introductory programming module. The analysis used anonymised records from 379 students, with 15\% marked as failing, and applied several machine learning algorithms. The first experiments showed poor recall for the failing group. However, using synthetic data generation methods led to a significant improvement in performance. Our results suggest that machine learning can help identify at-risk students early in programming courses when combined with synthetic data. This research lays the groundwork for validating and using these models with live student cohorts in the future, to allow for timely and effective interventions that can improve student outcomes. It also includes feature importance analysis to refine formative tasks. Overall, this study contributes to developing practical workflows that help detect disengagement early and improve student success in programming education.

Related papers

RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library [58.404895570822184]
RV-Syn is a novel mathematical Synthesis approach.<n>It generates graphs as solutions by combining Python-formatted functions from this library.<n>Based on the constructed graph, we achieve solution-guided logic-aware problem generation.
arXiv Detail & Related papers (2025-04-29T04:42:02Z)
Early Detection of At-Risk Students Using Machine Learning [0.0]
We aim to tackle the persistent challenges of higher education retention and student dropout rates by screening for at-risk students.<n>This work considers several machine learning models, including Support Vector Machines (SVM), Naive Bayes, K-nearest neighbors (KNN), Decision Trees, Logistic Regression, and Random Forest.<n>Our analysis indicates that all algorithms generate an acceptable outcome for at-risk student predictions, while Naive Bayes performs best overall.
arXiv Detail & Related papers (2024-12-12T17:33:06Z)
LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education [5.421088637597145]
Large language models (LLMs) offer a promising approach to create large-scale, privacy-preserving synthetic data. This work explores generating synthetic buggy code submissions for introductory programming exercises using GPT-4o. We compare the distribution of test case failures between synthetic and real student data from two courses to analyze the accuracy of the synthetic data in mimicking real student data.
arXiv Detail & Related papers (2024-11-01T00:24:59Z)
Detecting Unsuccessful Students in Cybersecurity Exercises in Two Different Learning Environments [0.37729165787434493]
This paper develops automated tools to predict when a student is having difficulty.<n>In a potential application, such models can aid instructors in detecting struggling students and providing targeted help.
arXiv Detail & Related papers (2024-08-16T04:57:54Z)
A Predictive Model using Machine Learning Algorithm in Identifying Students Probability on Passing Semestral Course [0.0]
This study employs classification for data mining techniques, and decision tree for algorithm. With the utilization of the newly discovered predictive model, the prediction of students probabilities to pass the current courses they take gives 0.7619 accuracy, 0.8333 precision, 0.8823 recall, and 0.8571 f1 score.
arXiv Detail & Related papers (2023-04-12T01:57:08Z)
Responsible Active Learning via Human-in-the-loop Peer Study [88.01358655203441]
We propose a responsible active learning method, namely Peer Study Learning (PSL), to simultaneously preserve data privacy and improve model stability. We first introduce a human-in-the-loop teacher-student architecture to isolate unlabelled data from the task learner (teacher) on the cloud-side. During training, the task learner instructs the light-weight active learner which then provides feedback on the active sampling criterion.
arXiv Detail & Related papers (2022-11-24T13:18:27Z)
Towards Robust Dataset Learning [90.2590325441068]
We propose a principled, tri-level optimization to formulate the robust dataset learning problem. Under an abstraction model that characterizes robust vs. non-robust features, the proposed method provably learns a robust dataset.
arXiv Detail & Related papers (2022-11-19T17:06:10Z)
A Survey of Learning on Small Data: Generalization, Optimization, and Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI. This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data. Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z)
What Makes Good Contrastive Learning on Small-Scale Wearable-based Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task. This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z)
Early Performance Prediction using Interpretable Patterns in Programming Process Data [13.413990352918098]
We leverage rich, fine-grained log data to build a model to predict student course outcomes. We evaluate our approach on a dataset from 106 students in a block-based, introductory programming course.
arXiv Detail & Related papers (2021-02-10T22:46:45Z)
BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration [72.88493072196094]
We present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a set of input-output examples. We show that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches.
arXiv Detail & Related papers (2020-07-28T17:46:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.