Related papers: Expect the Unexpected: FailSafe Long Context QA for Finance

Expect the Unexpected: FailSafe Long Context QA for Finance

URL: http://arxiv.org/abs/2502.06329v1
Date: Mon, 10 Feb 2025 10:29:28 GMT
Title: Expect the Unexpected: FailSafe Long Context QA for Finance
Authors: Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh,
Abstract summary: FailSafeQA is designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in finance.<n>We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA

Related papers

VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering [53.662676566188175]
A key bottleneck lies in the lack of public, large-scale, high-quality Scientific Visual Question Answering (SVQA) datasets.<n>We propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context.<n>We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types.
arXiv Detail & Related papers (2025-11-25T04:14:52Z)
SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models [17.805673311465295]
Uncertainty quantification (UQ) provides measures of uncertainty.<n>Black-box UQ methods do not require access to internal model information.<n>We propose a high-level non-verbalized similarity-based aggregation framework.
arXiv Detail & Related papers (2025-10-10T17:22:53Z)
XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning [28.967959142733903]
We introduce XFinBench, a novel benchmark to evaluate large language models' ability in solving financial problems.<n>O1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%.<n>We construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge only brings consistent accuracy improvements to small open-source model.
arXiv Detail & Related papers (2025-08-20T15:23:35Z)
CountQA: How Well Do MLLMs Count in the Wild? [1.0799568216202955]
Multimodal Large Language Models (MLLMs) demonstrate remarkable fluency in understanding visual scenes.<n>They exhibit a critical lack in a fundamental cognitive skill: object counting.<n>This blind spot severely limits their reliability in real-world applications.<n>We introduce CountQA, a challenging new benchmark designed to probe this deficiency.
arXiv Detail & Related papers (2025-08-08T04:23:04Z)
A Large Language Model-Empowered Agent for Reliable and Robust Structural Analysis [14.754785659805869]
Large language models (LLMs) have exhibited remarkable capabilities across diverse open-domain tasks, yet their application in specialized domains such as civil engineering remains largely unexplored.<n>This paper starts bridging this gap by evaluating and enhancing the reliability and robustness of LLMs in structural analysis of beams.<n> Experimental results demonstrate that the agent achieves accuracy exceeding 99.0% on the benchmark dataset, exhibiting reliable and robust performance across diverse conditions.
arXiv Detail & Related papers (2025-06-27T04:16:53Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [60.881609323604685]
Large Language Models (LLMs) accessed via black-box APIs introduce a trust challenge. Users pay for services based on advertised model capabilities. providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework [23.42251949130555]
Multimodal large language models (MLLMs) show promise in tasks like visual question answering (VQA) Recent works adapt agentic frameworks or chain-of-thought (CoT) reasoning to improve performance. We propose Seeing and Reasoning with Confidence (SRICE), a training-free multimodal reasoning framework.
arXiv Detail & Related papers (2025-03-11T11:18:53Z)
SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models [4.875712300661656]
We present SCORE ($mathbfS$ystematic $mathbfCO$nsistency and $mathbfR$obustness $mathbfE$valuation), a comprehensive framework for non-adversarial evaluation of Large Language Models. The SCORE framework evaluates models by repeatedly testing them on the same benchmarks in various setups to give a realistic estimate of their accuracy and consistency.
arXiv Detail & Related papers (2025-02-28T19:27:29Z)
Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency [66.96286531087549]
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches.<n>We propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods.<n>We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation.
arXiv Detail & Related papers (2025-02-07T14:30:12Z)
FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models [0.0]
FinanceQA is a testing suite that evaluates LLMs' performance on complex numerical financial analysis tasks that mirror real-world investment work.<n>Current LLMs fail to meet the strict accuracy requirements of financial institutions, with models failing approximately 60% of realistic tasks.<n>Results show that higher-quality training data is needed to support such tasks, which we experiment with using OpenAI's fine-tuning API.
arXiv Detail & Related papers (2025-01-30T00:06:55Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) [26.475993408532304]
We study the ability of an MLLM model to produce semantically similar or identical responses to semantically similar queries. We propose the MM-R$3$ benchmark, which analyses the performance in terms of consistency and accuracy in SoTA MLLMs. Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.
arXiv Detail & Related papers (2024-10-07T06:36:55Z)
Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs [10.494477811252034]
Fine-tuning large language models can lead to textitfine-tuning multiplicity, where equally well-performing models make conflicting predictions on the same inputs. This raises critical concerns about the robustness and reliability of Tabular LLMs. This work proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining.
arXiv Detail & Related papers (2024-07-04T22:22:09Z)
Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis [7.486549276995143]
Large Language Models (LLMs) have been shown to tackle table comprehension tasks without specific training.<n>We probe how in-context learning (ICL), model scale, instruction tuning, and domain bias affect Tabular QA (TQA)<n>We reveal a strong correlation between perturbation-induced shifts in attention dispersion and the drops in performance.
arXiv Detail & Related papers (2024-06-18T15:41:15Z)
Test-Time Fairness and Robustness in Large Language Models [17.758735680493917]
Frontier Large Language Models (LLMs) can be socially discriminatory or sensitive to spurious features of their inputs. Existing solutions, which instruct the LLM to be fair or robust, rely on the model's implicit understanding of bias. We show that our prompting strategy, unlike implicit instructions, consistently reduces the bias of frontier LLMs.
arXiv Detail & Related papers (2024-06-11T20:05:15Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z)
Feature Selection with Annealing for Forecasting Financial Time Series [2.44755919161855]
This study provides a comprehensive method for forecasting financial time series based on tactical input output feature mapping techniques using machine learning (ML) models. Experiments indicate that the FSA algorithm increased the performance of ML models, regardless of problem type.
arXiv Detail & Related papers (2023-03-03T21:33:38Z)
Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics. We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains. Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.