Related papers: Audit Trails for Accountability in Large Language Models

Audit Trails for Accountability in Large Language Models

URL: http://arxiv.org/abs/2601.20727v1
Date: Wed, 28 Jan 2026 16:04:33 GMT
Title: Audit Trails for Accountability in Large Language Models
Authors: Victor Ojewale, Harini Suresh, Suresh Venkatasubramanian,
Abstract summary: Large language models (LLMs) are increasingly embedded in consequential decisions across healthcare, finance, employment, and public services.<n>We propose audit trails as a sociotechnical mechanism for continuous accountability.<n>An audit trail is a chronological, tamper-evident, context-rich ledger of lifecycle events and decisions.
Score: 3.750249890675081
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly embedded in consequential decisions across healthcare, finance, employment, and public services. Yet accountability remains fragile because process transparency is rarely recorded in a durable and reviewable form. We propose LLM audit trails as a sociotechnical mechanism for continuous accountability. An audit trail is a chronological, tamper-evident, context-rich ledger of lifecycle events and decisions that links technical provenance (models, data, training and evaluation runs, deployments, monitoring) with governance records (approvals, waivers, and attestations), so organizations can reconstruct what changed, when, and who authorized it. This paper contributes: (1) a lifecycle framework that specifies event types, required metadata, and governance rationales; (2) a reference architecture with lightweight emitters, append only audit stores, and an auditor interface supporting cross organizational traceability; and (3) a reusable, open-source Python implementation that instantiates this audit layer in LLM workflows with minimal integration effort. We conclude by discussing limitations and directions for adoption.

Related papers

IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation [49.796717294455796]
We present IMMACULATE, a practical auditing framework that detects economically motivated deviations.<n>IMMACULATE selectively audits a small fraction of requests using verifiable computation, achieving strong detection guarantees while amortizing cryptographic overhead.
arXiv Detail & Related papers (2026-02-26T07:21:02Z)
Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z)
ORCHID: Orchestrated Retrieval-Augmented Classification with Human-in-the-Loop Intelligent Decision-Making for High-Risk Property [6.643427585499247]
ORCHID is a modular agentic system for HRP classification.<n>It pairs retrieval-augmented generation (RAG) with human oversight to produce policy-based outputs that can be audited.<n>The demonstration shows single item submission, grounded citations, SME feedback capture, and exportable audit artifacts.
arXiv Detail & Related papers (2025-11-07T03:48:05Z)
"Show Me You Comply... Without Showing Me Anything": Zero-Knowledge Software Auditing for AI-Enabled Systems [2.2981698355892686]
This paper introduces ZKMLOps, a novel MLOps verification framework.<n>It operationalizes Zero-Knowledge Proofs (ZKPs)-cryptographic protocols allowing a prover to convince a verifier that a statement is true.<n>We evaluate the framework's practicality through a study of regulatory compliance in financial risk auditing.
arXiv Detail & Related papers (2025-10-30T15:03:32Z)
Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation [55.47971671635531]
Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA)<n>Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge.<n>Existing systems primarily rely on unstructured documents, while largely overlooking relational databases.
arXiv Detail & Related papers (2025-09-30T22:19:44Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services [22.700907666937177]
This position paper highlights emerging accountability challenges in commercial Opaque LLM Services (COLS)<n>We formalize two key risks: textitquantity inflation, where token and call counts may be artificially inflated, and textitquality downgrade, where providers might quietly substitute lower-cost models or tools.<n>We propose a modular three-layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user-facing auditability without exposing proprietary internals.
arXiv Detail & Related papers (2025-05-24T02:26:49Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Pragmatic auditing: a pilot-driven approach for auditing Machine Learning systems [5.26895401335509]
We present a respective procedure that extends the AI-HLEG guidelines published by the European Commission. Our audit procedure is based on an ML lifecycle model that explicitly focuses on documentation, accountability, and quality assurance. We describe two pilots conducted on real-world use cases from two different organisations.
arXiv Detail & Related papers (2024-05-21T20:40:37Z)
OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs [59.836774258359945]
OpenFactCheck is a framework for building customized automatic fact-checking systems.<n>It allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims.<n>CheckerEVAL is a solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets.
arXiv Detail & Related papers (2024-05-09T07:15:19Z)
Accountability in Offline Reinforcement Learning: Explaining Decisions with a Corpus of Examples [70.84093873437425]
This paper introduces the Accountable Offline Controller (AOC) that employs the offline dataset as the Decision Corpus. AOC operates effectively in low-data scenarios, can be extended to the strictly offline imitation setting, and displays qualities of both conservation and adaptability. We assess AOC's performance in both simulated and real-world healthcare scenarios, emphasizing its capability to manage offline control tasks with high levels of performance while maintaining accountability.
arXiv Detail & Related papers (2023-10-11T17:20:32Z)
Multi-view Contrastive Self-Supervised Learning of Accounting Data Representations for Downstream Audit Tasks [1.9659095632676094]
International audit standards require the direct assessment of a financial statement's underlying accounting transactions, referred to as journal entries. Deep learning inspired audit techniques have emerged in the field of auditing vast quantities of journal entry data. We propose a contrastive self-supervised learning framework designed to learn audit task invariant accounting data representations.
arXiv Detail & Related papers (2021-09-23T08:16:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.