Related papers: ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

URL: http://arxiv.org/abs/2510.18941v1
Date: Tue, 21 Oct 2025 17:59:44 GMT
Title: ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
Authors: Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong,
Abstract summary: evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses.<n>We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts.<n>Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs.
Score: 94.40918390309186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench

Related papers

LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation [25.746010737879683]
Large Language Models (LLMs) have made rapid progress in reasoning, question answering, and professional applications.<n>Current datasets often focus on simplified tasks or artificial scenarios, overlooking long-tail knowledge and the complexities of real-world applications.<n>We propose LPFQA, a long-tail knowledge-based benchmark derived from authentic professional forums across 20 academic and industrial fields.
arXiv Detail & Related papers (2025-11-09T12:02:19Z)
From Online User Feedback to Requirements: Evaluating Large Language Models for Classification and Specification Tasks [0.777471208829183]
Large language models (LLMs) show strong potential to automate the analysis of online user feedback.<n>Existing studies offer limited empirical evidence, lack thorough evaluation, and rarely provide replication packages.<n>We evaluate five lightweight open-source LLMs on three requirements engineering (RE) tasks.
arXiv Detail & Related papers (2025-10-27T06:33:01Z)
"You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations [1.1254231171451319]
This paper investigates whether Large Language Models (LLMs) can pass hiring evaluations.<n>We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance.<n>Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions.
arXiv Detail & Related papers (2025-10-22T01:59:30Z)
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering [57.43420753842626]
FinLFQA is a benchmark designed to evaluate the ability of Large Language Models to generate long-form answers to complex financial questions.<n>We provide an automatic evaluation framework covering both answer quality and attribution quality.
arXiv Detail & Related papers (2025-10-07T20:06:15Z)
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z)
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks [43.509761349059914]
ProBench is a benchmark of open-ended user queries that require professional expertise and advanced reasoning.<n>It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing.<n>ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning.
arXiv Detail & Related papers (2025-03-10T03:29:18Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.<n>Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.<n>We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.