Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring
- URL: http://arxiv.org/abs/2501.10677v2
- Date: Sat, 01 Feb 2025 03:55:35 GMT
- Title: Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring
- Authors: Xia Li, Hanghang Zheng, Xiao Chen, Hong Liu, Mao Mao,
- Abstract summary: We present a novel framework for scaling up the application of large pretrained models on financial datasets.
We integrate the imbalance-aware techniques during dataset distillation, resulting in improved performance in financial datasets.
- Score: 10.737033782376905
- License:
- Abstract: The advent of artificial intelligence has significantly enhanced credit scoring technologies. Despite the remarkable efficacy of advanced deep learning models, mainstream adoption continues to favor tree-structured models due to their robust predictive performance on tabular data. Although pretrained models have seen considerable development, their application within the financial realm predominantly revolves around question-answering tasks and the use of such models for tabular-structured credit scoring datasets remains largely unexplored. Tabular-oriented large models, such as TabPFN, has made the application of large models in credit scoring feasible, albeit can only processing with limited sample sizes. This paper provides a novel framework to combine tabular-tailored dataset distillation technique with the pretrained model, empowers the scalability for TabPFN. Furthermore, though class imbalance distribution is the common nature in financial datasets, its influence during dataset distillation has not been explored. We thus integrate the imbalance-aware techniques during dataset distillation, resulting in improved performance in financial datasets (e.g., a 2.5% enhancement in AUC). This study presents a novel framework for scaling up the application of large pretrained models on financial tabular datasets and offers a comparative analysis of the influence of class imbalance on the dataset distillation process. We believe this approach can broaden the applications and downstream tasks of large models in the financial domain.
Related papers
- Beyond Tree Models: A Hybrid Model of KAN and gMLP for Large-Scale Financial Tabular Data [28.34587057844627]
TKGMLP is a hybrid network for tabular data that combines shallow Kolmogorov Arnold Networks with Gated Multilayer Perceptron.
We validate TKGMLP on a real-world credit scoring dataset, where it achieves state-of-the-art results and outperforms current benchmarks.
We propose a novel feature encoding method for numerical data, specifically designed to address the predominance of numerical features in financial datasets.
arXiv Detail & Related papers (2024-12-03T02:38:07Z) - A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models.
Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning.
We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z) - Escaping the Forest: Sparse Interpretable Neural Networks for Tabular Data [0.0]
We show that our models, Sparse TABular NET or sTAB-Net with attention mechanisms, are more effective than tree-based models.
They achieve better performance than post-hoc methods like SHAP.
arXiv Detail & Related papers (2024-10-23T10:50:07Z) - Gradient Reduction Convolutional Neural Network Policy for Financial Deep Reinforcement Learning [0.0]
This paper introduces two significant enhancements to refine our CNN model's predictive performance and robustness for financial data.
Firstly, we integrate a normalization layer at the input stage to ensure consistent feature scaling.
Secondly, we employ a Gradient Reduction Architecture, where earlier layers are wider and subsequent layers are progressively narrower.
arXiv Detail & Related papers (2024-08-16T11:39:03Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML
Evaluation [3.737892247639591]
We present Bank Account Fraud (BAF), the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets.
BAF is a set of challenges commonplace in real-world applications, including temporal dynamics and significant class imbalance.
We aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
arXiv Detail & Related papers (2022-11-24T00:03:29Z) - Graph-Regularized Tensor Regression: A Domain-Aware Framework for
Interpretable Multi-Way Financial Modelling [23.030263841031633]
We develop a novel Graph-Regularized Regression (GRTR) framework, whereby knowledge about cross-asset relations is incorporated into the model in the form of a graph Laplacian matrix.
By virtue of tensor algebra, the proposed framework is shown to be fully interpretable, both coefficient-wise and dimension-wise.
The GRTR model is validated in a multi-way financial forecasting setting and is shown to achieve improved performance at reduced computational costs.
arXiv Detail & Related papers (2022-10-26T13:39:08Z) - Back2Future: Leveraging Backfill Dynamics for Improving Real-time
Predictions in Future [73.03458424369657]
In real-time forecasting in public health, data collection is a non-trivial and demanding task.
'Backfill' phenomenon and its effect on model performance has been barely studied in the prior literature.
We formulate a novel problem and neural framework Back2Future that aims to refine a given model's predictions in real-time.
arXiv Detail & Related papers (2021-06-08T14:48:20Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - CHEER: Rich Model Helps Poor Model via Knowledge Infusion [69.23072792708263]
We develop a knowledge infusion framework named CHEER that can succinctly summarize such rich model into transferable representations.
Our empirical results showed that CHEER outperformed baselines by 5.60% to 46.80% in terms of the macro-F1 score on multiple physiological datasets.
arXiv Detail & Related papers (2020-05-21T21:44:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.