Related papers: Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

URL: http://arxiv.org/abs/2512.23848v1
Date: Mon, 29 Dec 2025 20:24:15 GMT
Title: Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs
Authors: Yukun Zhang, Stefan Elbl Droguett, Samyak Jain,
Abstract summary: We implement a multi-retriever Retrieval Augmented Generators system to retrieve both external domain knowledge and internal question contexts.<n>We find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model.
Score: 13.368251290146794
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper's top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.

Related papers

Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling [21.45871501724415]
We propose a data-driven approach to automatically construct highly granular reasoning model error.<n>Rubrics can be used to build stronger LLM-as-judge reward functions.<n>This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels.
arXiv Detail & Related papers (2026-02-06T15:51:52Z)
Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective [53.594353527056775]
We propose Chinese Commonsense Multi-hop Reasoning ( CCMOR) to evaluate Large Language Models (LLMs)<n> CCMOR is designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning.<n>We implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions.
arXiv Detail & Related papers (2025-10-09T20:29:00Z)
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering [57.43420753842626]
FinLFQA is a benchmark designed to evaluate the ability of Large Language Models to generate long-form answers to complex financial questions.<n>We provide an automatic evaluation framework covering both answer quality and attribution quality.
arXiv Detail & Related papers (2025-10-07T20:06:15Z)
FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning [29.526711154687945]
FinEval-KR is a novel evaluation framework for quantifying large language models' knowledge and reasoning abilities.<n>Inspired by cognitive science, we propose a cognitive score to analyze capabilities in reasoning tasks across different cognitive levels.<n>Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy.
arXiv Detail & Related papers (2025-06-18T06:21:50Z)
General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z)
Bridging Language Models and Financial Analysis [49.361943182322385]
The rapid advancements in Large Language Models (LLMs) have unlocked transformative possibilities in natural language processing.<n>Financial data is often embedded in intricate relationships across textual content, numerical tables, and visual charts.<n>Despite the fast pace of innovation in LLM research, there remains a significant gap in their practical adoption within the finance industry.
arXiv Detail & Related papers (2025-03-14T01:35:20Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Multi-Reranker: Maximizing performance of retrieval-augmented generation in the FinanceRAG challenge [5.279257531335345]
This paper details the development of a high-performance, finance-specific Retrieval-Augmented Generation (RAG) system for the ACM-ICAIF '24 FinanceRAG competition. We optimized performance through ablation studies on query expansion and corpus refinement during the pre-retrieval phase. Notably, we introduced an efficient method for managing long context sizes during the generation phase, significantly improving response quality without sacrificing performance.
arXiv Detail & Related papers (2024-11-23T09:56:21Z)
Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance [0.32985979395737774]
We present a detailed analysis of fine-tuning large language models (LLMs) for domain-specific tasks.<n>We find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy.<n>We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results.
arXiv Detail & Related papers (2024-10-01T22:35:56Z)
Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.<n>We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.