Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge
Evaluation
- URL: http://arxiv.org/abs/2306.05783v3
- Date: Mon, 11 Mar 2024 09:49:04 GMT
- Title: Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge
Evaluation
- Authors: Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin
Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu,
Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei
Feng, Yanghua Xiao
- Abstract summary: We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge.
Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi- Specialty and Xiezhi-Interdiscipline, both with 15k questions.
- Score: 61.56563631219381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: New Natural Langauge Process~(NLP) benchmarks are urgently needed to align
with the rapid development of large language models (LLMs). We present Xiezhi,
the most comprehensive evaluation suite designed to assess holistic domain
knowledge. Xiezhi comprises multiple-choice questions across 516 diverse
disciplines ranging from 13 different subjects with 249,587 questions and
accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k
questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results
indicate that LLMs exceed average performance of humans in science,
engineering, agronomy, medicine, and art, but fall short in economics,
jurisprudence, pedagogy, literature, history, and management. We anticipate
Xiezhi will help analyze important strengths and shortcomings of LLMs, and the
benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.
Related papers
- CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks.
We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z) - MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs.
We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z) - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs.
The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively.
A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z) - VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning [32.811840681428464]
Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks.
We present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning.
The best performance observed include a 53.4% accuracy in mathematics by Claude3.5-Sonnet, 38.2% in physics by GPT-4o, and 47.0% in chemistry by Gemini-1.5-Pro.
arXiv Detail & Related papers (2024-09-10T01:20:26Z) - LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models [46.77647640464652]
Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications.
We propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark.
It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams.
arXiv Detail & Related papers (2024-03-19T10:11:14Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - Domain Mastery Benchmark: An Ever-Updating Benchmark for Evaluating
Holistic Domain Knowledge of Large Language Model--A Preliminary Release [13.603414598813938]
DomMa targets at testing Large Language Models (LLMs) on their domain knowledge understanding.
It features extensive domain coverage, large data volume, and a continually updated data set based on Chinese 112 first-level subject classifications.
arXiv Detail & Related papers (2023-04-23T15:11:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.