Related papers: AuditCopilot: Leveraging LLMs for Fraud Detection in Double-Entry Bookkeeping

AuditCopilot: Leveraging LLMs for Fraud Detection in Double-Entry Bookkeeping

URL: http://arxiv.org/abs/2512.02726v1
Date: Tue, 02 Dec 2025 13:00:57 GMT
Title: AuditCopilot: Leveraging LLMs for Fraud Detection in Double-Entry Bookkeeping
Authors: Md Abdul Kadir, Sai Suresh Macharla Vasu, Sidharth S. Nair, Daniel Sonntag,
Abstract summary: Large language models (LLMs) can serve as anomaly detectors in double-entry bookkeeping.<n> Benchmarking SoTA LLMs on both synthetic and real-world anonymized ledgers, we compare them against JETs and machine learning baselines.<n>Results highlight the potential of textbfAI-augmented auditing, where human auditors collaborate with foundation models to strengthen financial integrity.
Score: 3.589046578266679
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Auditors rely on Journal Entry Tests (JETs) to detect anomalies in tax-related ledger records, but rule-based methods generate overwhelming false positives and struggle with subtle irregularities. We investigate whether large language models (LLMs) can serve as anomaly detectors in double-entry bookkeeping. Benchmarking SoTA LLMs such as LLaMA and Gemma on both synthetic and real-world anonymized ledgers, we compare them against JETs and machine learning baselines. Our results show that LLMs consistently outperform traditional rule-based JETs and classical ML baselines, while also providing natural-language explanations that enhance interpretability. These results highlight the potential of \textbf{AI-augmented auditing}, where human auditors collaborate with foundation models to strengthen financial integrity.

Related papers

Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection [17.04809129025246]
FinFRE-RAG is a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language.<n>LLMs can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts.
arXiv Detail & Related papers (2025-12-15T07:09:11Z)
Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML? [4.0057196015831495]
Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting.<n>This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task.<n>We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations.
arXiv Detail & Related papers (2025-10-29T17:05:00Z)
LLMs as verification oracles for Solidity [1.3887048755037537]
This paper provides the first systematic evaluation of GPT-5, a state-of-the-art reasoning LLM, in this role.<n>We benchmark its performance on a large dataset of verification tasks, compare its outputs against those of established formal verification tools, and assess its practical effectiveness in real-world auditing scenarios.<n>Our study suggests a new frontier in the convergence of AI and formal methods for secure smart contract development and auditing.
arXiv Detail & Related papers (2025-09-23T15:32:13Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
MoRE-LLM: Mixture of Rule Experts Guided by a Large Language Model [54.14155564592936]
We propose a Mixture of Rule Experts guided by a Large Language Model (MoRE-LLM)<n>MoRE-LLM steers the discovery of local rule-based surrogates during training and their utilization for the classification task.<n>LLM is responsible for enhancing the domain knowledge alignment of the rules by correcting and contextualizing them.
arXiv Detail & Related papers (2025-03-26T11:09:21Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment [53.17596274334017]
We evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs.<n>Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.
arXiv Detail & Related papers (2024-10-06T08:33:39Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.