Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge
Evaluation
- URL: http://arxiv.org/abs/2306.05783v3
- Date: Mon, 11 Mar 2024 09:49:04 GMT
- Title: Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge
Evaluation
- Authors: Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin
Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu,
Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei
Feng, Yanghua Xiao
- Abstract summary: We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge.
Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi- Specialty and Xiezhi-Interdiscipline, both with 15k questions.
- Score: 61.56563631219381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: New Natural Langauge Process~(NLP) benchmarks are urgently needed to align
with the rapid development of large language models (LLMs). We present Xiezhi,
the most comprehensive evaluation suite designed to assess holistic domain
knowledge. Xiezhi comprises multiple-choice questions across 516 diverse
disciplines ranging from 13 different subjects with 249,587 questions and
accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k
questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results
indicate that LLMs exceed average performance of humans in science,
engineering, agronomy, medicine, and art, but fall short in economics,
jurisprudence, pedagogy, literature, history, and management. We anticipate
Xiezhi will help analyze important strengths and shortcomings of LLMs, and the
benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.
Related papers
- LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs [8.89259409245068]
Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning.
We present the Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that offers 1,958 questions.
We find that the most advanced model, GPT-4, achieved a mere 54% accuracy in a multiple-choice scenario.
arXiv Detail & Related papers (2024-06-07T18:21:26Z) - LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models [46.77647640464652]
Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications.
We propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark.
It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams.
arXiv Detail & Related papers (2024-03-19T10:11:14Z) - SceMQA: A Scientific College Entrance Level Multimodal Question
Answering Benchmark [42.91902601376494]
The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level.
SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology.
It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities.
arXiv Detail & Related papers (2024-02-06T19:16:55Z) - MULTI: Multimodal Understanding Leaderboard with Text and Images [24.580401463432075]
We present MULTI as a cutting-edge benchmark for evaluating MLLMs on understanding complex tables and images, and reasoning with long context.
MULTI includes over 18,000 questions and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning.
Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a 63.7% accuracy rate on MULTI, in contrast to other MLLMs scoring between 28.5% and 55.3%.
arXiv Detail & Related papers (2024-02-05T16:41:02Z) - CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark [71.65760716397347]
We introduce a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.
CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context.
arXiv Detail & Related papers (2024-01-22T13:34:34Z) - MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI [64.21953221846596]
MMMU is a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks.
Questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types.
The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU.
arXiv Detail & Related papers (2023-11-27T17:33:21Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - Domain Mastery Benchmark: An Ever-Updating Benchmark for Evaluating
Holistic Domain Knowledge of Large Language Model--A Preliminary Release [13.603414598813938]
DomMa targets at testing Large Language Models (LLMs) on their domain knowledge understanding.
It features extensive domain coverage, large data volume, and a continually updated data set based on Chinese 112 first-level subject classifications.
arXiv Detail & Related papers (2023-04-23T15:11:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.