Enhancing Credit Default Prediction Using Boruta Feature Selection and DBSCAN Algorithm with Different Resampling Techniques
- URL: http://arxiv.org/abs/2509.19408v1
- Date: Tue, 23 Sep 2025 13:43:18 GMT
- Title: Enhancing Credit Default Prediction Using Boruta Feature Selection and DBSCAN Algorithm with Different Resampling Techniques
- Authors: Obu-Amoah Ampomah, Edmund Agyemang, Kofi Acheampong, Louis Agyekum,
- Abstract summary: This study examines credit default prediction by comparing three techniques, namely SMOTE, SMOTE-Tomek, and ADASYN.<n> recognizing that credit default datasets are typically skewed, we began our analysis by evaluating machine learning (ML) models on the imbalanced data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study examines credit default prediction by comparing three techniques, namely SMOTE, SMOTE-Tomek, and ADASYN, that are commonly used to address the class imbalance problem in credit default situations. Recognizing that credit default datasets are typically skewed, with defaulters comprising a much smaller proportion than non-defaulters, we began our analysis by evaluating machine learning (ML) models on the imbalanced data without any resampling to establish baseline performance. These baseline results provide a reference point for understanding the impact of subsequent balancing methods. In addition to traditional classifiers such as Naive Bayes and K-Nearest Neighbors (KNN), our study also explores the suitability of advanced ensemble boosting algorithms, including Extreme Gradient Boosting (XGBoost), AdaBoost, Gradient Boosting Machines (GBM), and Light GBM for credit default prediction using Boruta feature selection and DBSCAN-based outlier detection, both before and after resampling. A real-world credit default data set sourced from the University of Cleveland ML Repository was used to build ML classifiers, and their performances were tested. The criteria chosen to measure model performance are the area under the receiver operating characteristic curve (ROC-AUC), area under the precision-recall curve (PR-AUC), G-mean, and F1-scores. The results from this empirical study indicate that the Boruta+DBSCAN+SMOTE-Tomek+GBM classifier outperformed the other ML models (F1-score: 82.56%, G-mean: 82.98%, ROC-AUC: 90.90%, PR-AUC: 91.85%) in a credit default context. The findings establish a foundation for future progress in creating more resilient and adaptive credit default systems, which will be essential as credit-based transactions continue to rise worldwide.
Related papers
- Predicting Mortgage Default with Machine Learning: AutoML, Class Imbalance, and Leakage Control [7.5947844798897535]
We compare multiple machine learning approaches for mortgage default prediction using a real-world loan-level dataset.<n>We employ leakage-aware feature selection, a strict temporal split that constrains both origination and reporting periods.<n>Across multiple positive-to-negative ratios, performance remains stable, and an AutoML approach achieves the strongest AUROC among the models evaluated.
arXiv Detail & Related papers (2026-01-27T15:25:04Z) - Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models [86.53246292425699]
We present BCA+, a training-free framework for TTA for both object recognition and detection.<n>We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction.<n>BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.
arXiv Detail & Related papers (2025-10-03T06:27:33Z) - Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs.
We find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
arXiv Detail & Related papers (2024-05-29T21:29:44Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Credit card score prediction using machine learning models: A new
dataset [2.099922236065961]
This study investigates the utilization of machine learning (ML) models for credit card default prediction system.
The main goal here is to investigate the best-performing ML model for new proposed credit card scoring dataset.
arXiv Detail & Related papers (2023-10-04T16:46:26Z) - When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples.
A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction.
Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z) - Divide and Contrast: Source-free Domain Adaptation via Adaptive
Contrastive Learning [122.62311703151215]
Divide and Contrast (DaC) aims to connect the good ends of both worlds while bypassing their limitations.
DaC divides the target data into source-like and target-specific samples, where either group of samples is treated with tailored goals.
We further align the source-like domain with the target-specific samples using a memory bank-based Maximum Mean Discrepancy (MMD) loss to reduce the distribution mismatch.
arXiv Detail & Related papers (2022-11-12T09:21:49Z) - Estimating oil recovery factor using machine learning: Applications of
XGBoost classification [0.0]
In petroleum engineering, it is essential to determine the ultimate recovery factor, RF, particularly before exploitation and exploration.
We, therefore, applied machine learning (ML), using readily available features, to estimate oil RF for ten classes defined in this study.
arXiv Detail & Related papers (2022-10-28T18:21:25Z) - Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.
We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy.
We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z) - Predicting Credit Risk for Unsecured Lending: A Machine Learning
Approach [0.0]
This research paper is to build a contemporary credit scoring model to forecast credit defaults for unsecured lending (credit cards)
Our research indicates that the Light Gradient Boosting Machine (LGBM) model is better equipped to deliver higher learning speeds, better efficiencies and manage larger data volumes.
We expect that deployment of this model will enable better and timely prediction of credit defaults for decision-makers in commercial lending institutions and banks.
arXiv Detail & Related papers (2021-10-05T17:54:56Z) - Machine Learning approach for Credit Scoring [0.0]
We build a stack of machine learning models aimed at composing a state-of-the-art credit rating and default prediction system.
Our approach is an excursion through the most recent ML / AI concepts.
arXiv Detail & Related papers (2020-07-20T21:29:06Z) - Provable tradeoffs in adversarially robust classification [96.48180210364893]
We develop and leverage new tools, including recent breakthroughs from probability theory on robust isoperimetry.
Our results reveal fundamental tradeoffs between standard and robust accuracy that grow when data is imbalanced.
arXiv Detail & Related papers (2020-06-09T09:58:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.