Unsupervised Outlier Detection in Audit Analytics: A Case Study Using USA Spending Data
- URL: http://arxiv.org/abs/2509.19366v1
- Date: Fri, 19 Sep 2025 01:27:18 GMT
- Title: Unsupervised Outlier Detection in Audit Analytics: A Case Study Using USA Spending Data
- Authors: Buhe Li, Berkay Kaplan, Maksym Lazirko, Aleksandr Kogan,
- Abstract summary: We employ and compare multiple outlier detection algorithms to identify anomalies in federal spending patterns.<n>Results indicate that a hybrid approach, combining multiple detection strategies, enhances the robustness and accuracy of outlier identification.
- Score: 39.44036223885694
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study investigates the effectiveness of unsupervised outlier detection methods in audit analytics, utilizing USA spending data from the U.S. Department of Health and Human Services (DHHS) as a case example. We employ and compare multiple outlier detection algorithms, including Histogram-based Outlier Score (HBOS), Robust Principal Component Analysis (PCA), Minimum Covariance Determinant (MCD), and K-Nearest Neighbors (KNN) to identify anomalies in federal spending patterns. The research addresses the growing need for efficient and accurate anomaly detection in large-scale governmental datasets, where traditional auditing methods may fall short. Our methodology involves data preparation, algorithm implementation, and performance evaluation using precision, recall, and F1 scores. Results indicate that a hybrid approach, combining multiple detection strategies, enhances the robustness and accuracy of outlier identification in complex financial data. This study contributes to the field of audit analytics by providing insights into the comparative effectiveness of various outlier detection models and demonstrating the potential of unsupervised learning techniques in improving audit quality and efficiency. The findings have implications for auditors, policymakers, and researchers seeking to leverage advanced analytics in governmental financial oversight and risk management.
Related papers
- Repurposing Synthetic Data for Fine-grained Search Agent Supervision [81.95597592711688]
LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
arXiv Detail & Related papers (2025-10-28T17:50:40Z) - A Comparative Analysis of Statistical and Machine Learning Models for Outlier Detection in Bitcoin Limit Order Books [0.0]
This study conducts a comparative analysis of robust statistical methods and advanced machine learning techniques for real-time anomaly identification in cryptocurrency limit order books (LOBs)<n>We evaluate the efficacy of thirteen diverse models to identify which approaches are most suitable for detecting potentially manipulative trading behaviours.<n>An empirical evaluation, conducted via backtesting on a dataset of 26,204 records from a major exchange, demonstrates that the top-performing model, Empirical Covariance (EC), achieves a 6.70% gain, significantly outperforming a standard Buy-and-Hold benchmark.
arXiv Detail & Related papers (2025-07-20T13:42:36Z) - DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective [70.77570343385928]
We introduce a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing)<n>We formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset.<n>Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery.<n>Our benchmark, DATABench, comprises 17 evasion attacks, 5 forgery attacks, and 9
arXiv Detail & Related papers (2025-07-08T03:07:15Z) - The role of data partitioning on the performance of EEG-based deep learning models in supervised cross-subject analysis: a preliminary study [37.69303106863453]
Deep learning is advancing the analysis of electroencephalography (EEG) data by effectively discovering highly nonlinear patterns.<n>No comprehensive guidelines for proper data partitioning and cross-validation exist in the domain.<n>This paper thoroughly investigates the role of data partitioning and cross-validation in evaluating EEG deep learning models.
arXiv Detail & Related papers (2025-05-19T12:05:28Z) - Access Denied: Meaningful Data Access for Quantitative Algorithm Audits [4.182284365432724]
Third-party audits are often hindered by access restrictions, forcing auditors to rely on limited, low-quality data.<n>We conduct audit simulations on two realistic case studies for recidivism and healthcare coverage prediction.<n>We find that data minimization and anonymization practices can strongly increase error rates on individual-level data, leading to unreliable assessments.
arXiv Detail & Related papers (2025-02-01T13:33:45Z) - A Computational Method for Measuring "Open Codes" in Qualitative Analysis [44.39424825305388]
This paper presents a theory-informed computational method for measuring inductive coding results from humans and Generative AI (GAI)<n>It measures each coder's contribution against the merged result using four novel metrics: Coverage, Overlap, Novelty, and Divergence.<n>Our work provides a reliable pathway for ensuring methodological rigor in human-AI qualitative analysis.
arXiv Detail & Related papers (2024-11-19T00:44:56Z) - A Review of Global Sensitivity Analysis Methods and a comparative case study on Digit Classification [5.458813674116228]
Global sensitivity analysis (GSA) aims to detect influential input factors that lead to a model to arrive at a certain decision.
We provide a comprehensive review and a comparison on global sensitivity analysis methods.
arXiv Detail & Related papers (2024-06-23T00:38:19Z) - Applied Machine Learning to Anomaly Detection in Enterprise Purchase Processes [0.0]
This work proposes a methodology to prioritise the investigation of the cases detected in two large purchase datasets from real data.
The goal is to contribute to the effectiveness of the companies' control efforts and to increase the performance of carrying out such tasks.
arXiv Detail & Related papers (2024-05-23T16:21:51Z) - Weakly Supervised Anomaly Detection: A Survey [75.26180038443462]
Anomaly detection (AD) is a crucial task in machine learning with various applications.
We present the first comprehensive survey of weakly supervised anomaly detection (WSAD) methods.
For each setting, we provide formal definitions, key algorithms, and potential future directions.
arXiv Detail & Related papers (2023-02-09T10:27:21Z) - Understanding metric-related pitfalls in image analysis validation [59.15220116166561]
This work provides the first comprehensive common point of access to information on pitfalls related to validation metrics in image analysis.
Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy.
arXiv Detail & Related papers (2023-02-03T14:57:40Z) - PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning [58.85063149619348]
We propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows.
Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets.
arXiv Detail & Related papers (2023-01-25T16:34:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.