Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
- URL: http://arxiv.org/abs/2508.15202v1
- Date: Thu, 21 Aug 2025 03:31:11 GMT
- Title: Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
- Authors: Yuanchen Zhou, Shuo Jiang, Jie Zhu, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang,
- Abstract summary: Fin-PRM is a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks.<n>It integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic.<n>We show that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality.
- Score: 12.415988471162997
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.
Related papers
- $φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models [58.217707070069885]
This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or $$-DPO) framework for continual learning in LMMs.<n>We first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals.<n> Extensive experiments and ablation studies show the proposed $$-DPO achieves State-of-the-Art performance across multiple benchmarks.
arXiv Detail & Related papers (2026-02-26T04:14:33Z) - FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering [57.43420753842626]
FinLFQA is a benchmark designed to evaluate the ability of Large Language Models to generate long-form answers to complex financial questions.<n>We provide an automatic evaluation framework covering both answer quality and attribution quality.
arXiv Detail & Related papers (2025-10-07T20:06:15Z) - A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs [8.842756364986704]
We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA.<n>Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer.
arXiv Detail & Related papers (2025-09-10T09:40:18Z) - Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning [12.548390779247987]
We introduce the Agentar-Fin-R1 series of financial large language models.<n>Our optimization approach integrates a high-quality, systematic financial task label system.<n>Our models undergo comprehensive evaluation on mainstream financial benchmarks.
arXiv Detail & Related papers (2025-07-22T17:52:16Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning [82.7292329605713]
FinChain is the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance.<n>It spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces.<n>FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.
arXiv Detail & Related papers (2025-06-03T06:44:42Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models [13.567516575993546]
We propose DianJin-R1, a reasoning-enhanced framework for large language models (LLMs) in the financial domain.<n>Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC)<n>Our models, DianJin-R1-7B and DianJin-R1-32B, are fine-tuned from Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct using a structured format that generates both reasoning steps and final answers.
arXiv Detail & Related papers (2025-04-22T09:01:04Z) - A Survey on Post-training of Large Language Models [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Integration and Adaptation, which
arXiv Detail & Related papers (2025-03-08T05:41:42Z) - Fino1: On the Transferability of Reasoning-Enhanced LLMs and Reinforcement Learning to Finance [35.617409883103335]
FinReason is the first financial reasoning benchmark covering multi-table analysis, long-context reasoning, and equation-based tasks.<n>We introduce FinCoT, the first open high-fidelity CoT corpus for finance, distilled from seven QA datasets.<n>We develop Fin-o1, the first open financial reasoning models trained via supervised fine-tuning and GRPO-based RL.
arXiv Detail & Related papers (2025-02-12T05:13:04Z) - Demystifying Domain-adaptive Post-training for Financial LLMs [79.581577578952]
FINDAP is a systematic and fine-grained investigation into domain adaptive post-training of large language models (LLMs)<n>Our approach consists of four key components: FinCap, FinRec, FinTrain and FinEval.<n>The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks.
arXiv Detail & Related papers (2025-01-09T04:26:15Z) - FinBen: A Holistic Financial Benchmark for Large Language Models [75.09474986283394]
FinBen is the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks.
FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading.
arXiv Detail & Related papers (2024-02-20T02:16:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.