Related papers: A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs

A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs

URL: http://arxiv.org/abs/2509.09727v1
Date: Wed, 10 Sep 2025 09:40:18 GMT
Title: A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs
Authors: Andy Zhu, Yingjun Du,
Abstract summary: We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA.<n>Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer.
Score: 8.842756364986704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem-solving. The financial domain demands multistep quantitative reasoning, familiarity with domain-specific terminology, and comprehension of real-world scenarios. We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert-designed finance education questions from Study.com, an online learning platform. We leverage retrieval-augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain-expert reviewer. Our experiments indicate that critique-based refinement improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with the highest performance from Gemini-2.0-Flash. Furthermore, our method enables GPT-4o-mini to achieve performance comparable to the finance-tuned FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to enhancing financial QA and offer insights for further research in multi-agent financial LLM systems.

Related papers

Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs [13.368251290146794]
We implement a multi-retriever Retrieval Augmented Generators system to retrieve both external domain knowledge and internal question contexts.<n>We find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model.
arXiv Detail & Related papers (2025-12-29T20:24:15Z)
FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis [110.5695516127813]
HisRubric is a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric.<n>FinDeepResearch is a benchmark that comprises 64 listed companies from 8 financial markets across 4 languages.<n>We conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only.
arXiv Detail & Related papers (2025-10-15T17:21:56Z)
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering [57.43420753842626]
FinLFQA is a benchmark designed to evaluate the ability of Large Language Models to generate long-form answers to complex financial questions.<n>We provide an automatic evaluation framework covering both answer quality and attribution quality.
arXiv Detail & Related papers (2025-10-07T20:06:15Z)
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models [12.415988471162997]
Fin-PRM is a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks.<n>It integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic.<n>We show that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality.
arXiv Detail & Related papers (2025-08-21T03:31:11Z)
FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance.<n>The benchmark consists of 3,429 expert-annotated examples on S&P-100 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z)
Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III [0.0]
This paper presents a benchmark evaluating 23 state-of-the-art Large Language Models (LLMs) on the Chartered Financial Analyst (CFA) Level III exam.<n>We assess both multiple-choice questions (MCQs) and essay-style responses using multiple prompting strategies including Chain-of-Thought and Self-Discover.<n>Our evaluation reveals that leading models demonstrate strong capabilities, with composite scores such as 79.1% (o4-mini) and 77.3% (Gemini 2.5 Flash) on CFA Level III.
arXiv Detail & Related papers (2025-06-29T19:54:57Z)
FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation [65.04104723843264]
We present FinDER, an expert-generated dataset tailored for Retrieval-Augmented Generation (RAG) in finance.<n>FinDER focuses on annotating search-relevant evidence by domain experts, offering 5,703 query-evidence-answer triplets.<n>By challenging models to retrieve relevant information from large corpora, FinDER offers a more realistic benchmark for evaluating RAG systems.
arXiv Detail & Related papers (2025-04-22T11:30:13Z)
FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering [18.821122274064116]
We introduce FAMMA, an open-source benchmark for underlinefinunderlineancial underlinemultilingual underlinemultimodal question underlineanswering (QA)<n>Our benchmark aims to evaluate the abilities of large language models (LLMs) in answering complex reasoning questions that require advanced financial knowledge.
arXiv Detail & Related papers (2024-10-06T15:41:26Z)
Financial Knowledge Large Language Model [4.599537455808687]
We introduce IDEA-FinBench, an evaluation benchmark for assessing financial knowledge in large language models (LLMs) We propose IDEA-FinKER, a framework designed to facilitate the rapid adaptation of general LLMs to the financial domain. Finally, we present IDEA-FinQA, a financial question-answering system powered by LLMs.
arXiv Detail & Related papers (2024-06-29T08:26:49Z)
FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z)
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data. We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.