Expect the Unexpected: FailSafe Long Context QA for Finance
- URL: http://arxiv.org/abs/2502.06329v1
- Date: Mon, 10 Feb 2025 10:29:28 GMT
- Title: Expect the Unexpected: FailSafe Long Context QA for Finance
- Authors: Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh,
- Abstract summary: FailSafeQA is designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in finance.<n>We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA
Related papers
- Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [60.881609323604685]
Large Language Models (LLMs) accessed via black-box APIs introduce a trust challenge.
Users pay for services based on advertised model capabilities.
providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs.
This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking.
arXiv Detail & Related papers (2025-04-07T03:57:41Z) - Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework [23.42251949130555]
Multimodal large language models (MLLMs) show promise in tasks like visual question answering (VQA)
Recent works adapt agentic frameworks or chain-of-thought (CoT) reasoning to improve performance.
We propose Seeing and Reasoning with Confidence (SRICE), a training-free multimodal reasoning framework.
arXiv Detail & Related papers (2025-03-11T11:18:53Z) - SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models [4.875712300661656]
We present SCORE ($mathbfS$ystematic $mathbfCO$nsistency and $mathbfR$obustness $mathbfE$valuation), a comprehensive framework for non-adversarial evaluation of Large Language Models.
The SCORE framework evaluates models by repeatedly testing them on the same benchmarks in various setups to give a realistic estimate of their accuracy and consistency.
arXiv Detail & Related papers (2025-02-28T19:27:29Z) - FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models [0.0]
FinanceQA is a testing suite that evaluates LLMs' performance on complex numerical financial analysis tasks that mirror real-world investment work.<n>Current LLMs fail to meet the strict accuracy requirements of financial institutions, with models failing approximately 60% of realistic tasks.<n>Results show that higher-quality training data is needed to support such tasks, which we experiment with using OpenAI's fine-tuning API.
arXiv Detail & Related papers (2025-01-30T00:06:55Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) [26.475993408532304]
We study the ability of an MLLM model to produce semantically similar or identical responses to semantically similar queries.
We propose the MM-R$3$ benchmark, which analyses the performance in terms of consistency and accuracy in SoTA MLLMs.
Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.
arXiv Detail & Related papers (2024-10-07T06:36:55Z) - Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs [10.494477811252034]
Fine-tuning large language models can lead to textitfine-tuning multiplicity, where equally well-performing models make conflicting predictions on the same inputs.
This raises critical concerns about the robustness and reliability of Tabular LLMs.
This work proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining.
arXiv Detail & Related papers (2024-07-04T22:22:09Z) - Test-Time Fairness and Robustness in Large Language Models [17.758735680493917]
Frontier Large Language Models (LLMs) can be socially discriminatory or sensitive to spurious features of their inputs.
Existing solutions, which instruct the LLM to be fair or robust, rely on the model's implicit understanding of bias.
We show that our prompting strategy, unlike implicit instructions, consistently reduces the bias of frontier LLMs.
arXiv Detail & Related papers (2024-06-11T20:05:15Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z) - Feature Selection with Annealing for Forecasting Financial Time Series [2.44755919161855]
This study provides a comprehensive method for forecasting financial time series based on tactical input output feature mapping techniques using machine learning (ML) models.
Experiments indicate that the FSA algorithm increased the performance of ML models, regardless of problem type.
arXiv Detail & Related papers (2023-03-03T21:33:38Z) - Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics.
We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains.
Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.