ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
- URL: http://arxiv.org/abs/2510.18941v1
- Date: Tue, 21 Oct 2025 17:59:44 GMT
- Title: ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
- Authors: Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong,
- Abstract summary: evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses.<n>We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts.<n>Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs.
- Score: 94.40918390309186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench
Related papers
- LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation [25.746010737879683]
Large Language Models (LLMs) have made rapid progress in reasoning, question answering, and professional applications.<n>Current datasets often focus on simplified tasks or artificial scenarios, overlooking long-tail knowledge and the complexities of real-world applications.<n>We propose LPFQA, a long-tail knowledge-based benchmark derived from authentic professional forums across 20 academic and industrial fields.
arXiv Detail & Related papers (2025-11-09T12:02:19Z) - From Online User Feedback to Requirements: Evaluating Large Language Models for Classification and Specification Tasks [0.777471208829183]
Large language models (LLMs) show strong potential to automate the analysis of online user feedback.<n>Existing studies offer limited empirical evidence, lack thorough evaluation, and rarely provide replication packages.<n>We evaluate five lightweight open-source LLMs on three requirements engineering (RE) tasks.
arXiv Detail & Related papers (2025-10-27T06:33:01Z) - "You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations [1.1254231171451319]
This paper investigates whether Large Language Models (LLMs) can pass hiring evaluations.<n>We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance.<n>Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions.
arXiv Detail & Related papers (2025-10-22T01:59:30Z) - FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering [57.43420753842626]
FinLFQA is a benchmark designed to evaluate the ability of Large Language Models to generate long-form answers to complex financial questions.<n>We provide an automatic evaluation framework covering both answer quality and attribution quality.
arXiv Detail & Related papers (2025-10-07T20:06:15Z) - KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z) - ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks [43.509761349059914]
ProBench is a benchmark of open-ended user queries that require professional expertise and advanced reasoning.<n>It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing.<n>ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning.
arXiv Detail & Related papers (2025-03-10T03:29:18Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.<n>Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.<n>We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.