Missing Data Imputation With Granular Semantics and AI-driven Pipeline for Bankruptcy Prediction
- URL: http://arxiv.org/abs/2404.00013v1
- Date: Fri, 15 Mar 2024 13:01:09 GMT
- Title: Missing Data Imputation With Granular Semantics and AI-driven Pipeline for Bankruptcy Prediction
- Authors: Debarati Chakraborty, Ravi Ranjan,
- Abstract summary: This work focuses on designing a pipeline for the prediction of bankruptcy.
The presence of missing values, high dimensional data, and highly class-imbalance databases are the major challenges in the said task.
A new method for missing data imputation with granular semantics has been introduced here.
- Score: 0.34530027457862006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work focuses on designing a pipeline for the prediction of bankruptcy. The presence of missing values, high dimensional data, and highly class-imbalance databases are the major challenges in the said task. A new method for missing data imputation with granular semantics has been introduced here. The merits of granular computing have been explored here to define this method. The missing values have been predicted using the feature semantics and reliable observations in a low-dimensional space, in the granular space. The granules are formed around every missing entry, considering a few of the highly correlated features and most reliable closest observations to preserve the relevance and reliability, the context, of the database against the missing entries. An intergranular prediction is then carried out for the imputation within those contextual granules. That is, the contextual granules enable a small relevant fraction of the huge database to be used for imputation and overcome the need to access the entire database repetitively for each missing value. This method is then implemented and tested for the prediction of bankruptcy with the Polish Bankruptcy dataset. It provides an efficient solution for big and high-dimensional datasets even with large imputation rates. Then an AI-driven pipeline for bankruptcy prediction has been designed using the proposed granular semantic-based data filling method followed by the solutions to the issues like high dimensional dataset and high class-imbalance in the dataset. The rest of the pipeline consists of feature selection with the random forest for reducing dimensionality, data balancing with SMOTE, and prediction with six different popular classifiers including deep NN. All methods defined here have been experimentally verified with suitable comparative studies and proven to be effective on all the data sets captured over the five years.
Related papers
- Efficient Imputation for Patch-based Missing Single-cell Data via Cluster-regularized Optimal Transport [11.748577799315191]
We present CROT, an optimal transport-based imputation algorithm designed to handle patch-based missing data.<n>Our approach effectively captures the underlying data structure in the presence of significant missingness.<n>This work introduces a robust solution for imputation in heterogeneous, high-dimensional datasets with structured data absence.
arXiv Detail & Related papers (2026-01-21T04:58:13Z) - Kernel Representation and Similarity Measure for Incomplete Data [55.62595187178638]
Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis.<n>Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates.<n>This paper presents a new similarity measure that directly computes similarity between incomplete data in kernel feature space without explicit imputation in the original space.
arXiv Detail & Related papers (2025-10-15T09:41:23Z) - RFOD: Random Forest-based Outlier Detection for Tabular Data [12.469208664014472]
Outlier detection is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare.<n>textsfRFOD reframes anomaly detection as a feature-wise conditional reconstruction problem.<n>textsfRFOD consistently outperforms state-of-the-art baselines in detection accuracy.
arXiv Detail & Related papers (2025-10-09T19:02:12Z) - Revisiting Multivariate Time Series Forecasting with Missing Values [65.30332997607141]
Missing values are common in real-world time series.<n>Current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data.<n>This framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy.<n>We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle.
arXiv Detail & Related papers (2025-09-27T20:57:48Z) - Data Retrieval with Importance Weights for Few-Shot Imitation Learning [31.8638426686593]
We introduce Importance Weighted Retrieval (IWR), which estimates importance weights, or the ratio between the target and prior data distributions for retrieval.<n>IWR consistently improves performance of existing retrieval-based methods, despite only requiring minor modifications.
arXiv Detail & Related papers (2025-09-01T17:58:41Z) - PEEL the Layers and Find Yourself: Revisiting Inference-time Data Leakage for Residual Neural Networks [64.90981115460937]
This paper explores inference-time data leakage risks of deep neural networks (NNs)
We propose a novel backward feature inversion method, textbfPEEL, which can effectively recover block-wise input features from the intermediate output of residual NNs.
Our results show that PEEL outperforms the state-of-the-art recovery methods by an order of magnitude when evaluated by mean squared error (MSE)
arXiv Detail & Related papers (2025-04-08T20:11:05Z) - DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.
Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.
Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series [0.01874930567916036]
Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets.
We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools.
Our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature.
arXiv Detail & Related papers (2024-11-21T09:03:12Z) - Iterative Forgetting: Online Data Stream Regression Using Database-Inspired Adaptive Granulation [1.6874375111244329]
We present a database-inspired datastream regression model that uses inspiration from R*-trees to create granules from incoming datastreams.
Experiments demonstrate that the ability of this method to discard data produces a significant order-of-magnitude improvement in latency and training time.
arXiv Detail & Related papers (2024-03-14T17:26:00Z) - BlockEcho: Retaining Long-Range Dependencies for Imputing Block-Wise
Missing Data [2.507127323074818]
Block-wise missing data poses significant challenges in real-world data imputation tasks.
Most SOTA matrix completion methods appeared less effective, primarily due to overreliance on neighboring elements for predictions.
We propose a novel matrix completion method BlockEcho" for a more comprehensive solution.
arXiv Detail & Related papers (2024-02-29T02:13:10Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Robust self-healing prediction model for high dimensional data [0.685316573653194]
This work proposes a robust self healing (RSH) hybrid prediction model.
It functions by using the data in its entirety by removing errors and inconsistencies from it rather than discarding any data.
The proposed method is compared with some of the existing high performing models and the results are analyzed.
arXiv Detail & Related papers (2022-10-04T17:55:50Z) - Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced
Dataset and Benchmark [62.997667081978825]
The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident.
The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
arXiv Detail & Related papers (2022-05-20T21:15:26Z) - Minimax rate of consistency for linear models with missing values [0.0]
Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...).
In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task.
This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets.
arXiv Detail & Related papers (2022-02-03T08:45:34Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Learning Output Embeddings in Structured Prediction [73.99064151691597]
A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension.
A prediction in the original space is computed by solving a pre-image problem.
In this work, we propose to jointly learn a finite approximation of the output embedding and the regression function into the new feature space.
arXiv Detail & Related papers (2020-07-29T09:32:53Z) - Establishing strong imputation performance of a denoising autoencoder in
a wide range of missing data problems [0.0]
We develop a consistent framework for both training and imputation.
We benchmarked the results against state-of-the-art imputation methods.
The developed autoencoder obtained the smallest error for all ranges of initial data corruption.
arXiv Detail & Related papers (2020-04-06T12:00:30Z) - Uncertainty Estimation Using a Single Deep Deterministic Neural Network [66.26231423824089]
We propose a method for training a deterministic deep model that can find and reject out of distribution data points at test time with a single forward pass.
We scale training in these with a novel loss function and centroid updating scheme and match the accuracy of softmax models.
arXiv Detail & Related papers (2020-03-04T12:27:36Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.