Enhancing Data Quality through Self-learning on Imbalanced Financial Risk Data
- URL: http://arxiv.org/abs/2409.09792v1
- Date: Sun, 15 Sep 2024 16:59:15 GMT
- Title: Enhancing Data Quality through Self-learning on Imbalanced Financial Risk Data
- Authors: Xu Sun, Zixuan Qin, Shun Zhang, Yuexian Wang, Li Huang,
- Abstract summary: This study investigates data pre-processing techniques to enhance existing financial risk datasets.
We introduce TriEnhance, a straightforward technique that entails: (1) generating synthetic samples specifically tailored to the minority class, (2) filtering using binary feedback to refine samples, and (3) self-learning with pseudo-labels.
Our experiments reveal the efficacy of TriEnhance, with a notable focus on improving minority class calibration, a key factor for developing more robust financial risk prediction systems.
- Score: 11.910955398918444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the financial risk domain, particularly in credit default prediction and fraud detection, accurate identification of high-risk class instances is paramount, as their occurrence can have significant economic implications. Although machine learning models have gained widespread adoption for risk prediction, their performance is often hindered by the scarcity and diversity of high-quality data. This limitation stems from factors in datasets such as small risk sample sizes, high labeling costs, and severe class imbalance, which impede the models' ability to learn effectively and accurately forecast critical events. This study investigates data pre-processing techniques to enhance existing financial risk datasets by introducing TriEnhance, a straightforward technique that entails: (1) generating synthetic samples specifically tailored to the minority class, (2) filtering using binary feedback to refine samples, and (3) self-learning with pseudo-labels. Our experiments across six benchmark datasets reveal the efficacy of TriEnhance, with a notable focus on improving minority class calibration, a key factor for developing more robust financial risk prediction systems.
Related papers
- Provably Unlearnable Data Examples [27.24152626809928]
Efforts have been undertaken to render shared data unlearnable for unauthorized models in the wild.
We propose a mechanism for certifying the so-called $(q, eta)$-Learnability of an unlearnable dataset.
A lower certified $(q, eta)$-Learnability indicates a more robust and effective protection over the dataset.
arXiv Detail & Related papers (2024-05-06T09:48:47Z) - Uncertainty for Active Learning on Graphs [70.44714133412592]
Uncertainty Sampling is an Active Learning strategy that aims to improve the data efficiency of machine learning models.
We benchmark Uncertainty Sampling beyond predictive uncertainty and highlight a significant performance gap to other Active Learning strategies.
We develop ground-truth Bayesian uncertainty estimates in terms of the data generating process and prove their effectiveness in guiding Uncertainty Sampling toward optimal queries.
arXiv Detail & Related papers (2024-05-02T16:50:47Z) - The Risk of Federated Learning to Skew Fine-Tuning Features and
Underperform Out-of-Distribution Robustness [50.52507648690234]
Federated learning has the risk of skewing fine-tuning features and compromising the robustness of the model.
We introduce three robustness indicators and conduct experiments across diverse robust datasets.
Our approach markedly enhances the robustness across diverse scenarios, encompassing various parameter-efficient fine-tuning methods.
arXiv Detail & Related papers (2024-01-25T09:18:51Z) - DeRisk: An Effective Deep Learning Framework for Credit Risk Prediction
over Real-World Financial Data [13.480823015283574]
We propose DeRisk, an effective deep learning risk prediction framework for credit risk prediction on real-world financial data.
DeRisk is the first deep risk prediction model that outperforms statistical learning approaches deployed in our company's production system.
arXiv Detail & Related papers (2023-08-07T16:22:59Z) - Learning for Counterfactual Fairness from Observational Data [62.43249746968616]
Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age.
A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data.
In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE.
arXiv Detail & Related papers (2023-07-17T04:08:29Z) - Fair Active Learning: Solving the Labeling Problem in Insurance [2.5470832667329213]
The paper explores various active learning sampling methodologies and evaluates their impact on both synthetic and real insurance datasets.
The proposed approach samples informative and fair instances, achieving a good balance between model predictive performance and fairness.
arXiv Detail & Related papers (2021-12-17T12:07:04Z) - Detecting and Mitigating Test-time Failure Risks via Model-agnostic
Uncertainty Learning [30.86992077157326]
This paper introduces Risk Advisor, a novel post-hoc meta-learner for estimating failure risks and predictive uncertainties of any already-trained black-box classification model.
In addition to providing a risk score, the Risk Advisor decomposes the uncertainty estimates into aleatoric and epistemic uncertainty components.
Experiments on various families of black-box classification models and on real-world and synthetic datasets show that the Risk Advisor reliably predicts deployment-time failure risks.
arXiv Detail & Related papers (2021-09-09T17:23:31Z) - Learning from Similarity-Confidence Data [94.94650350944377]
We investigate a novel weakly supervised learning problem of learning from similarity-confidence (Sconf) data.
We propose an unbiased estimator of the classification risk that can be calculated from only Sconf data and show that the estimation error bound achieves the optimal convergence rate.
arXiv Detail & Related papers (2021-02-13T07:31:16Z) - Provable tradeoffs in adversarially robust classification [96.48180210364893]
We develop and leverage new tools, including recent breakthroughs from probability theory on robust isoperimetry.
Our results reveal fundamental tradeoffs between standard and robust accuracy that grow when data is imbalanced.
arXiv Detail & Related papers (2020-06-09T09:58:19Z) - Precise Tradeoffs in Adversarial Training for Linear Regression [55.764306209771405]
We provide a precise and comprehensive understanding of the role of adversarial training in the context of linear regression with Gaussian features.
We precisely characterize the standard/robust accuracy and the corresponding tradeoff achieved by a contemporary mini-max adversarial training approach.
Our theory for adversarial training algorithms also facilitates the rigorous study of how a variety of factors (size and quality of training data, model overparametrization etc.) affect the tradeoff between these two competing accuracies.
arXiv Detail & Related papers (2020-02-24T19:01:47Z) - On the Role of Dataset Quality and Heterogeneity in Model Confidence [27.657631193015252]
Safety-critical applications require machine learning models that output accurate and calibrated probabilities.
Uncalibrated deep networks are known to make over-confident predictions.
We study the impact of dataset quality by studying the impact of dataset size and the label noise on the model confidence.
arXiv Detail & Related papers (2020-02-23T05:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.