Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge
Evaluation
- URL: http://arxiv.org/abs/2306.05783v3
- Date: Mon, 11 Mar 2024 09:49:04 GMT
- Title: Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge
Evaluation
- Authors: Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin
Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu,
Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei
Feng, Yanghua Xiao
- Abstract summary: We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge.
Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi- Specialty and Xiezhi-Interdiscipline, both with 15k questions.
- Score: 61.56563631219381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: New Natural Langauge Process~(NLP) benchmarks are urgently needed to align
with the rapid development of large language models (LLMs). We present Xiezhi,
the most comprehensive evaluation suite designed to assess holistic domain
knowledge. Xiezhi comprises multiple-choice questions across 516 diverse
disciplines ranging from 13 different subjects with 249,587 questions and
accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k
questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results
indicate that LLMs exceed average performance of humans in science,
engineering, agronomy, medicine, and art, but fall short in economics,
jurisprudence, pedagogy, literature, history, and management. We anticipate
Xiezhi will help analyze important strengths and shortcomings of LLMs, and the
benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.
Related papers
- SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [122.04298386571692]
Large language models (LLMs) have demonstrated remarkable proficiency in academic disciplines such as mathematics, physics, and computer science.
However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks.
We present SuperGPQA, a benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines.
arXiv Detail & Related papers (2025-02-20T17:05:58Z) - Humanity's Last Exam [253.45228996132735]
Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge.
It consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences.
Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval.
arXiv Detail & Related papers (2025-01-24T05:27:46Z) - MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge [24.66666826440994]
MINTQA is a benchmark to evaluate large language models' capabilities in multi-hop reasoning.
MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge.
Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries.
arXiv Detail & Related papers (2024-12-22T14:17:12Z) - CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks.
We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z) - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs.
The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively.
A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z) - VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning [20.56989082014445]
Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks.
We present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning.
The best performance observed include a 53.4% accuracy in mathematics by Claude3.5-Sonnet, 38.2% in physics by GPT-4o, and 47.0% in chemistry by Gemini-1.5-Pro.
arXiv Detail & Related papers (2024-09-10T01:20:26Z) - LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models [46.77647640464652]
Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications.
We propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark.
It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams.
arXiv Detail & Related papers (2024-03-19T10:11:14Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - Domain Mastery Benchmark: An Ever-Updating Benchmark for Evaluating
Holistic Domain Knowledge of Large Language Model--A Preliminary Release [13.603414598813938]
DomMa targets at testing Large Language Models (LLMs) on their domain knowledge understanding.
It features extensive domain coverage, large data volume, and a continually updated data set based on Chinese 112 first-level subject classifications.
arXiv Detail & Related papers (2023-04-23T15:11:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.