Related papers: Financial Instruction Following Evaluation (FIFE)

Financial Instruction Following Evaluation (FIFE)

URL: http://arxiv.org/abs/2512.08965v1
Date: Mon, 01 Dec 2025 00:39:19 GMT
Title: Financial Instruction Following Evaluation (FIFE)
Authors: Glenn Matlin, Siddharth, Anirudh JM, Aditya Shukla, Yahya Hassan, Sudheer Chava,
Abstract summary: We introduce FIFE, a novel, high-difficulty benchmark designed to assess LM instruction-following capabilities for financial analysis tasks.<n> FIFE comprises 88 human-authored prompts and employs a verification system with chainable, verifiable constraints for fine-grained reward signals.<n>We release our dataset and code as an open-source resource to promote research in Reinforcement Learning for the financial domain.
Score: 4.4409035166872135
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Language Models (LMs) struggle with complex, interdependent instructions, particularly in high-stakes domains like finance where precision is critical. We introduce FIFE, a novel, high-difficulty benchmark designed to assess LM instruction-following capabilities for financial analysis tasks. FIFE comprises 88 human-authored prompts and employs a verification system with chainable, verifiable constraints for fine-grained reward signals. We evaluate 53 models (proprietary, open-weight, open-source) in a zero-shot setting. Our key findings reveal a clear performance hierarchy: the top open-weight model (76.1 strict / 79.5 loose) surpasses the leading proprietary system (65.9 strict / 70.5 loose), while the best open-source models lag significantly (45.5 strict / 48.9 loose). However, even top-performing models struggle with FIFE's complex requirements, failing to achieve perfect compliance. We release our dataset and code as an open-source resource to promote research in Reinforcement Learning for the financial domain.

Related papers

SSR: Safeguarding Staking Rewards by Defining and Detecting Logical Defects in DeFi Staking [55.62033436283969]
Decentralized Finance (DeFi) staking is one of the most prominent applications within the DeFi ecosystem.<n> logical defects in DeFi staking could enable attackers to claim unwarranted rewards.<n>We developed SSR (Safeguarding Staking Reward), a static analysis tool designed to detect logical defects in DeFi staking contracts.
arXiv Detail & Related papers (2026-01-09T15:01:41Z)
VERAFI: Verified Agentic Financial Intelligence through Neurosymbolic Policy Generation [2.43679682660038]
VERAFI is an agentic framework with neurosymbolic policy generation for verified financial intelligence.<n> VERAFI combines state-of-the-art dense retrieval and cross-encoder reranking with financial tool-enabled agents and automated reasoning policies.
arXiv Detail & Related papers (2025-12-12T17:17:43Z)
LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows [0.5798758080057375]
Nondeterministic outputs (output drift) undermine auditability and trust.<n>We quantify drift across five model architectures on regulated financial tasks.<n>This finding challenges conventional assumptions that larger models are universally superior for production deployment.
arXiv Detail & Related papers (2025-11-10T19:54:00Z)
Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning [12.548390779247987]
We introduce the Agentar-Fin-R1 series of financial large language models.<n>Our optimization approach integrates a high-quality, systematic financial task label system.<n>Our models undergo comprehensive evaluation on mainstream financial benchmarks.
arXiv Detail & Related papers (2025-07-22T17:52:16Z)
One Token to Fool LLM-as-a-Judge [52.45386385722788]
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models.<n>We uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking.
arXiv Detail & Related papers (2025-07-11T17:55:22Z)
Expect the Unexpected: FailSafe Long Context QA for Finance [0.0]
FailSafeQA is designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in finance.<n>We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models.
arXiv Detail & Related papers (2025-02-10T10:29:28Z)
FISHNET: Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert Swarms, and Task Planning [2.616867378362811]
FISHNET is an agentic architecture that accomplishes highly complex analytical tasks for more than 98,000 regulatory filings. FISHNET shows remarkable performance for financial insight generation.
arXiv Detail & Related papers (2024-10-25T17:53:47Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.