Secure and Explainable Fraud Detection in Finance via Hierarchical Multi-source Dataset Distillation
- URL: http://arxiv.org/abs/2512.21866v1
- Date: Fri, 26 Dec 2025 05:00:35 GMT
- Title: Secure and Explainable Fraud Detection in Finance via Hierarchical Multi-source Dataset Distillation
- Authors: Yiming Qian, Thorsten Neumann, Xueyining Huang, David Hardoon, Fei Gao, Yong Liu, Siow Mong Rick Goh,
- Abstract summary: A trained random forest is converted into transparent, axis-aligned rule regions.<n>Synthetic transactions are generated by uniformly sampling within each region.<n>This produces a compact, auditable surrogate dataset.
- Score: 17.90471000973834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an explainable, privacy-preserving dataset distillation framework for collaborative financial fraud detection. A trained random forest is converted into transparent, axis-aligned rule regions (leaf hyperrectangles), and synthetic transactions are generated by uniformly sampling within each region. This produces a compact, auditable surrogate dataset that preserves local feature interactions without exposing sensitive original records. The rule regions also support explainability: aggregated rule statistics (for example, support and lift) describe global patterns, while assigning each case to its generating region gives concise human-readable rationales and calibrated uncertainty based on tree-vote disagreement. On the IEEE-CIS fraud dataset (590k transactions across three institution-like clusters), distilled datasets reduce data volume by 85% to 93% (often under 15% of the original) while maintaining competitive precision and micro-F1, with only a modest AUC drop. Sharing and augmenting with synthesized data across institutions improves cross-cluster precision, recall, and AUC. Real vs. synthesized structure remains highly similar (over 93% by nearest-neighbor cosine analysis). Membership-inference attacks perform at chance level (about 0.50) when distinguishing training from hold-out records, suggesting low memorization risk. Removing high-uncertainty synthetic points using disagreement scores further boosts AUC (up to 0.687) and improves calibration. Sensitivity tests show weak dependence on the distillation ratio (AUC about 0.641 to 0.645 from 6% to 60%). Overall, tree-region distillation enables trustworthy, deployable fraud analytics with interpretable global rules, per-case rationales with quantified uncertainty, and strong privacy properties suitable for multi-institution settings and regulatory audit.
Related papers
- Semantic-Constrained Federated Aggregation: Convergence Theory and Privacy-Utility Bounds for Knowledge-Enhanced Distributed Learning [0.0]
We introduce Semantic-Constrained Federated Aggregation (SCFA), a theoretically-grounded framework incorporating domain knowledge constraints into distributed optimization.<n>We prove SCFA convergence rate O(1/sqrt(T) + rho) where rho represents constraint violation rate, establishing the first convergence theory for constraint-based federated learning.<n>We validate our framework on manufacturing predictive maintenance using Bosch production data with 1.18 million samples and 968 sensor features.
arXiv Detail & Related papers (2025-12-12T04:29:29Z) - Natural Geometry of Robust Data Attribution: From Convex Models to Deep Networks [9.553350856191743]
We present a unified framework for robust attribution that extends from convex models to deep networks.<n>For convex settings, we derive Wasserstein-Robust Influence Functions (W-RIF) with provable coverage guarantees.<n>For deep networks, we demonstrate that Euclidean certification is rendered vacuous by spectral amplification.
arXiv Detail & Related papers (2025-12-09T20:40:27Z) - Conformal Lesion Segmentation for 3D Medical Images [82.92159832699583]
We propose a risk-constrained framework that calibrates data-driven thresholds via conformalization to ensure the test-time FNR remains below a target tolerance.<n>We validate the statistical soundness and predictive performance of CLS on six 3D-LS datasets across five backbone models, and conclude with actionable insights for deploying risk-aware segmentation in clinical practice.
arXiv Detail & Related papers (2025-10-19T08:21:00Z) - Privacy Preserved Federated Learning with Attention-Based Aggregation for Biometric Recognition [0.0]
The A3-FL framework is evaluated using FVC2004 fingerprint data.<n>The accuracy, convergence speed, and robustness of the A3-FL framework are superior to those of standard FL (FedAvg) and static baselines.
arXiv Detail & Related papers (2025-10-01T16:58:59Z) - Perfectly-Private Analog Secure Aggregation in Federated Learning [51.61616734974475]
In federated learning, multiple parties train models locally and share their parameters with a central server, which aggregates them to update a global model.<n>In this paper, a novel secure parameter aggregation method is proposed that employs the torus rather than a finite field.
arXiv Detail & Related papers (2025-09-10T15:22:40Z) - GANDiff FR: Hybrid GAN Diffusion Synthesis for Causal Bias Attribution in Face Recognition [0.0]
We introduce GANDiff FR, the first synthetic framework that precisely controls demographic and environmental factors to measure, explain, and reduce bias with reproducible rigor.<n>We synthesize 10,000 demographically balanced faces across five cohorts validated for realism via automated detection and human review.<n> Benchmarking ArcFace, CosFace, and AdaFace under matched operating points shows AdaFace reduces inter-group TPR disparity by 60%.<n>Despite around 20% computational overhead relative to pure GANs, GANDiff FR yields three times more attribute-conditioned variants.
arXiv Detail & Related papers (2025-08-15T09:05:57Z) - Localization Meets Uncertainty: Uncertainty-Aware Multi-Modal Localization [5.414146574747448]
This study introduces a percentile-based rejection strategy that filters out unreliable 3-DoF pose predictions.<n> Experimental results show that applying stricter uncertainty thresholds consistently improves pose accuracy.
arXiv Detail & Related papers (2025-04-10T12:07:24Z) - Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems [89.35169042718739]
collaborative inference enables end users to leverage powerful deep learning models without exposure of sensitive raw data to cloud servers.<n>Recent studies have revealed that these intermediate features may not sufficiently preserve privacy, as information can be leaked and raw data can be reconstructed via model inversion attacks (MIAs)<n>This work first theoretically proves that the conditional entropy of inputs given intermediate features provides a guaranteed lower bound on the reconstruction mean square error (MSE) under any MIA.<n>Then, we derive a differentiable and solvable measure for bounding this conditional entropy based on the Gaussian mixture estimation and propose a conditional entropy algorithm to enhance the inversion robustness
arXiv Detail & Related papers (2025-03-01T07:15:21Z) - Logit Calibration and Feature Contrast for Robust Federated Learning on Non-IID Data [45.11652096723593]
Federated learning (FL) is a privacy-preserving distributed framework for collaborative model training on devices in edge networks.
This paper proposes FatCC, which incorporates local logit underlineCalibration and global feature underlineContrast into the vanilla federated adversarial training process from both logit and feature perspectives.
arXiv Detail & Related papers (2024-04-10T06:35:25Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - Divide and Contrast: Source-free Domain Adaptation via Adaptive
Contrastive Learning [122.62311703151215]
Divide and Contrast (DaC) aims to connect the good ends of both worlds while bypassing their limitations.
DaC divides the target data into source-like and target-specific samples, where either group of samples is treated with tailored goals.
We further align the source-like domain with the target-specific samples using a memory bank-based Maximum Mean Discrepancy (MMD) loss to reduce the distribution mismatch.
arXiv Detail & Related papers (2022-11-12T09:21:49Z) - Supervised Discriminative Sparse PCA with Adaptive Neighbors for
Dimensionality Reduction [47.1456603605763]
We propose a novel linear dimensionality reduction approach, supervised discriminative sparse PCA with adaptive neighbors (SDSPCAAN)
As a result, both global and local data structures, as well as the label information, are used for better dimensionality reduction.
arXiv Detail & Related papers (2020-01-09T17:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.