Related papers: PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

URL: http://arxiv.org/abs/2511.11562v1
Date: Fri, 14 Nov 2025 18:55:12 GMT
Title: PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
Authors: Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, Yunzhong He,
Abstract summary: Professional Reasoning Bench (PRBench) is a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law.<n>We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it the largest public, rubric-based benchmark for both legal and finance domains.
Score: 18.32501228579171
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.

Related papers

Evaluating LLMs in Finance Requires Explicit Bias Consideration [88.38155218924999]
Finance-specific biases can inflate performance, contaminate backtests, and make reported results useless for deployment claims.<n>No single bias is discussed in more than 28 percent of studies.<n>We propose a Structural Validity Framework and an evaluation checklist with minimal requirements for bias diagnosis and future system design.
arXiv Detail & Related papers (2026-02-15T17:02:01Z)
JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks [14.14645345504797]
We propose JADE, a two-layer evaluation framework for agentic AI.<n> Layer 1 encodes expert knowledge as a predefined set of evaluation skills.<n> Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies.
arXiv Detail & Related papers (2026-02-06T08:26:09Z)
Benchmarking Agents in Insurance Underwriting Environments [0.9728664856449597]
Existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity.<n>We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts.
arXiv Detail & Related papers (2026-01-31T02:12:11Z)
PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [67.71760070255425]
We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-23T11:36:10Z)
BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment [12.163992099059461]
We introduce BizFinBench.v2, the first large-scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets.<n>We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks, totaling 29,578 expert-level Q&A pairs.<n>ChatGPT-5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists.<n>In online tasks, DeepSeek-R1 outperforms all other commercial LLMs.
arXiv Detail & Related papers (2026-01-10T02:51:53Z)
DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports [49.217247659479476]
deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis.<n>Existing benchmarks often lack systematic criteria for expert reporting.<n>We introduce DEER, a benchmark for evaluating expert-level deep research reports.
arXiv Detail & Related papers (2025-12-19T16:46:20Z)
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [94.40918390309186]
evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses.<n>We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts.<n>Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs.
arXiv Detail & Related papers (2025-10-21T17:59:44Z)
Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study [1.6770212301915661]
This study presents the first comprehensive evaluation of state-of-the-art LLMs using 1,560 multiple-choice questions from official mock exams across Levels I-III of CFA.<n>We compare models distinguished by core design priorities: multi-modal and computationally powerful, reasoning-specialized and highly accurate, and lightweight efficiency-optimized.
arXiv Detail & Related papers (2025-08-29T06:13:21Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning [12.548390779247987]
We introduce the Agentar-Fin-R1 series of financial large language models.<n>Our optimization approach integrates a high-quality, systematic financial task label system.<n>Our models undergo comprehensive evaluation on mainstream financial benchmarks.
arXiv Detail & Related papers (2025-07-22T17:52:16Z)
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions [85.88573535033406]
CRMArena-Pro is a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings.<n>It incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments.<n>Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings.
arXiv Detail & Related papers (2025-05-24T21:33:22Z)
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models [61.324062412648075]
CFinBench is an evaluation benchmark for assessing the financial knowledge of large language models (LLMs) under Chinese context. It comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%.
arXiv Detail & Related papers (2024-07-02T14:34:36Z)
CSPRD: A Financial Policy Retrieval Dataset for Chinese Stock Market [61.59326951366202]
We propose a new task, policy retrieval, by introducing the Chinese Stock Policy Retrieval dataset (CSPRD) CSPRD provides 700+ passages labeled by experienced experts with relevant articles from 10k+ entries in our collected Chinese policy corpus. Our best performing baseline achieves 56.1% MRR@10, 28.5% NDCG@10, 37.5% Recall@10 and 80.6% Precision@10 on dev set.
arXiv Detail & Related papers (2023-09-08T15:40:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.