Predicting First Year Dropout from Pre Enrolment Motivation Statements Using Text Mining
- URL: http://arxiv.org/abs/2509.16224v1
- Date: Fri, 12 Sep 2025 09:32:02 GMT
- Title: Predicting First Year Dropout from Pre Enrolment Motivation Statements Using Text Mining
- Authors: K. F. B. Soppe, A. Bagheri, S. Nadi, I. G. Klugkist, T. Wubbels, L. D. N. V. Wijngaards-De Meij,
- Abstract summary: High School GPA is a strong predictor of dropout, but much variance in dropout remains to be explained.<n>This study focused on predicting university dropout by using text mining techniques.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Preventing student dropout is a major challenge in higher education and it is difficult to predict prior to enrolment which students are likely to drop out and which students are likely to succeed. High School GPA is a strong predictor of dropout, but much variance in dropout remains to be explained. This study focused on predicting university dropout by using text mining techniques with the aim of exhuming information contained in motivation statements written by students. By combining text data with classic predictors of dropout in the form of student characteristics, we attempt to enhance the available set of predictive student characteristics. Our dataset consisted of 7,060 motivation statements of students enrolling in a non-selective bachelor at a Dutch university in 2014 and 2015. Support Vector Machines were trained on 75 percent of the data and several models were estimated on the test data. We used various combinations of student characteristics and text, such as TFiDF, topic modelling, LIWC dictionary. Results showed that, although the combination of text and student characteristics did not improve the prediction of dropout, text analysis alone predicted dropout similarly well as a set of student characteristics. Suggestions for future research are provided.
Related papers
- SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning [0.4369550829556578]
School dropout is a serious problem in distance learning, where early detection is crucial for effective intervention and student perseverance.<n>We introduce a novel model that combines sentiment analysis of student comments using the Bidirectional Representations from Transformers (BERT) model.<n>Our model was tested on unseen data from the next academic year, achieving an accuracy of 84%, compared to 82% for the baseline model.
arXiv Detail & Related papers (2025-07-14T16:04:34Z) - Misspellings in Natural Language Processing: A survey [52.419589623702336]
misspellings have become ubiquitous in digital communication.<n>We reconstruct a history of misspellings as a scientific problem.<n>We discuss the latest advancements to address the challenge of misspellings in NLP.
arXiv Detail & Related papers (2025-01-28T10:26:04Z) - Predicting Long-Term Student Outcomes from Short-Term EdTech Log Data [24.198449873743762]
We investigate machine learning predictors using students' logs during their first few hours of usage.<n>Our findings suggest that short-term log usage data, from 2-5 hours, can be used to provide valuable signal about students' long-term external performance.
arXiv Detail & Related papers (2024-12-20T01:05:23Z) - Why Do Students Drop Out? University Dropout Prediction and Associated
Factor Analysis Using Machine Learning Techniques [0.5042480200195721]
This study examined university dropout prediction using academic, demographic, socioeconomic, and macroeconomic data types.
The data type most influential to the model performance was found to be academic data.
Preliminary results indicate that a correlation does exist between data types and dropout status.
arXiv Detail & Related papers (2023-10-17T04:20:00Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records.
We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data.
We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z) - A Predictive Model for Student Performance in Classrooms Using Student
Interactions With an eTextbook [0.0]
This paper proposes a new model for predicting student performance based on an analysis of how students interact with an interactive online eTextbook.
To build the proposed model, we evaluated the most popular classification and regression algorithms on data from a data structures and algorithms course.
arXiv Detail & Related papers (2022-02-16T11:59:53Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z) - Predicting MOOCs Dropout Using Only Two Easily Obtainable Features from
the First Week's Activities [56.1344233010643]
Several features are considered to contribute towards learner attrition or lack of interest, which may lead to disengagement or total dropout.
This study aims to predict dropout early-on, from the first week, by comparing several machine-learning approaches.
arXiv Detail & Related papers (2020-08-12T10:44:49Z) - ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning [85.33459673197149]
We introduce a new Reading dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations.
In this paper, we propose to identify biased data points and separate them into EASY set and the rest as HARD set.
Empirical results show that state-of-the-art models have an outstanding ability to capture biases contained in the dataset with high accuracy on EASY set.
However, they struggle on HARD set with poor performance near that of random guess, indicating more research is needed to essentially enhance the logical reasoning ability of current models.
arXiv Detail & Related papers (2020-02-11T11:54:29Z) - Academic Performance Estimation with Attention-based Graph Convolutional
Networks [17.985752744098267]
Given a student's past data, the task of student's performance prediction is to predict a student's grades in future courses.
Traditional methods for student's performance prediction usually neglect the underlying relationships between multiple courses.
We propose a novel attention-based graph convolutional networks model for student's performance prediction.
arXiv Detail & Related papers (2019-12-26T23:11:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.