MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
- URL: http://arxiv.org/abs/2406.01574v6
- Date: Wed, 06 Nov 2024 02:54:00 GMT
- Title: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
- Authors: Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen,
- Abstract summary: This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark.
With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro.
Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.
- Score: 44.840266648465054
- License:
- Abstract: In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.
Related papers
- Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models [95.34001906930152]
Large Language Models (LLMs) have the potential to transform online shopping by alleviating task-specific engineering efforts.
We propose Shopping MMLU, a diverse multi-task online shopping benchmark derived from real-world Amazon data.
Shopping MMLU consists of 57 tasks covering 4 major shopping skills: concept understanding, knowledge reasoning, user behavior alignment, and multi-linguality.
arXiv Detail & Related papers (2024-10-28T05:25:47Z) - MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark [77.93283927871758]
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark.
MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities.
arXiv Detail & Related papers (2024-09-04T15:31:26Z) - MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs [2.3603377248944017]
Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models.
We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning.
Our results show that MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of model discrimination.
arXiv Detail & Related papers (2024-09-03T19:31:03Z) - MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs [88.28014831467503]
We introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset.
MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks.
We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations.
arXiv Detail & Related papers (2024-06-17T17:59:47Z) - MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models [8.7734602595507]
We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs)
We modified standardized test questions by replacing a key term with a dummy word along with its definition.
We found a substantial reduction in model performance after such replacement, suggesting poor comprehension.
arXiv Detail & Related papers (2024-06-15T05:35:47Z) - Are We Done with MMLU? [18.740187299563473]
We identify and analyse errors in the popular Massive Multitask Language Understanding benchmark.
For example, we find that 57% of the analysed questions in the Virology subset contain errors.
We create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.
arXiv Detail & Related papers (2024-06-06T14:49:06Z) - An Improved Traditional Chinese Evaluation Suite for Foundation Model [15.669799471464676]
We present TMMLU+, a new benchmark designed for Traditional Chinese language understanding.
It is a multi-choice question-answering dataset with 66 subjects from elementary to professional level.
We also benchmark closed-source models and 26 open-weight Chinese large language models (LLMs) of parameters ranging from 1.8B to 72B.
arXiv Detail & Related papers (2024-03-04T09:13:33Z) - MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI [64.21953221846596]
MMMU is a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks.
Questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types.
The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU.
arXiv Detail & Related papers (2023-11-27T17:33:21Z) - An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models [116.50367506746713]
We present an empirical study of scaling LLaVA up to 33B and 65B/70B.
We find that scaling LMM consistently enhances model performance and improves language capabilities.
We hope that this study makes state-of-the-art LMM research at a larger scale more accessible.
arXiv Detail & Related papers (2023-09-18T17:30:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.