LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows
- URL: http://arxiv.org/abs/2511.07585v1
- Date: Wed, 12 Nov 2025 01:05:38 GMT
- Title: LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows
- Authors: Raffi Khatchadourian, Rolando Franco,
- Abstract summary: Nondeterministic outputs (output drift) undermine auditability and trust.<n>We quantify drift across five model architectures on regulated financial tasks.<n>This finding challenges conventional assumptions that larger models are universally superior for production deployment.
- Score: 0.5798758080057375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Financial institutions deploy Large Language Models (LLMs) for reconciliations, regulatory reporting, and client communications, but nondeterministic outputs (output drift) undermine auditability and trust. We quantify drift across five model architectures (7B-120B parameters) on regulated financial tasks, revealing a stark inverse relationship: smaller models (Granite-3-8B, Qwen2.5-7B) achieve 100% output consistency at T=0.0, while GPT-OSS-120B exhibits only 12.5% consistency (95% CI: 3.5-36.0%) regardless of configuration (p<0.0001, Fisher's exact test). This finding challenges conventional assumptions that larger models are universally superior for production deployment. Our contributions include: (i) a finance-calibrated deterministic test harness combining greedy decoding (T=0.0), fixed seeds, and SEC 10-K structure-aware retrieval ordering; (ii) task-specific invariant checking for RAG, JSON, and SQL outputs using finance-calibrated materiality thresholds (plus or minus 5%) and SEC citation validation; (iii) a three-tier model classification system enabling risk-appropriate deployment decisions; and (iv) an audit-ready attestation system with dual-provider validation. We evaluated five models (Qwen2.5-7B via Ollama, Granite-3-8B via IBM watsonx.ai, Llama-3.3-70B, Mistral-Medium-2505, and GPT-OSS-120B) across three regulated financial tasks. Across 480 runs (n=16 per condition), structured tasks (SQL) remain stable even at T=0.2, while RAG tasks show drift (25-75%), revealing task-dependent sensitivity. Cross-provider validation confirms deterministic behavior transfers between local and cloud deployments. We map our framework to Financial Stability Board (FSB), Bank for International Settlements (BIS), and Commodity Futures Trading Commission (CFTC) requirements, demonstrating practical pathways for compliance-ready AI deployments.
Related papers
- Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents [0.7699235580548228]
LLM agents struggle with regulatory audit replay: when asked to reproduce a transaction flagged decision with identical inputs, most deployments fail to return consistent results.<n>This paper introduces the DeterminismFaithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using agents deployed in financial services.
arXiv Detail & Related papers (2026-01-17T19:47:55Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - Financial Instruction Following Evaluation (FIFE) [4.4409035166872135]
We introduce FIFE, a novel, high-difficulty benchmark designed to assess LM instruction-following capabilities for financial analysis tasks.<n> FIFE comprises 88 human-authored prompts and employs a verification system with chainable, verifiable constraints for fine-grained reward signals.<n>We release our dataset and code as an open-source resource to promote research in Reinforcement Learning for the financial domain.
arXiv Detail & Related papers (2025-12-01T00:39:19Z) - RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z) - You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs [50.54173262572369]
Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture.<n>We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision.
arXiv Detail & Related papers (2025-10-11T14:00:39Z) - Type-Compliant Adaptation Cascades: Adapting Programmatic LM Workflows to Data [12.136710894967088]
We introduce Type-Compliant Adaptation Cascades, a framework that recasts workflow adaptation as learning typed probabilistic programs.<n> Empirically, TACs significantly outperform state-of-the-art prompt-optimization baselines.
arXiv Detail & Related papers (2025-08-25T17:36:21Z) - OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z) - Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports [0.0]
We propose a fine-tuned vision-language model (VLM) based on Qwen2.5-VL-7B.<n>Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA.<n>Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% markdown TEDS score.
arXiv Detail & Related papers (2025-08-04T04:54:00Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models [13.567516575993546]
We propose DianJin-R1, a reasoning-enhanced framework for large language models (LLMs) in the financial domain.<n>Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC)<n>Our models, DianJin-R1-7B and DianJin-R1-32B, are fine-tuned from Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct using a structured format that generates both reasoning steps and final answers.
arXiv Detail & Related papers (2025-04-22T09:01:04Z) - Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z) - SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z) - FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z) - KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models [41.94295877935867]
KodeXv0.1 is a family of large language models that outclass GPT-4 in financial question answering.
We process a large number of publicly available financial documents such as earnings calls and business reports.
arXiv Detail & Related papers (2024-09-13T16:43:08Z) - Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks.
We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving.
This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.