ArcMMLU: A Library and Information Science Benchmark for Large Language
Models
- URL: http://arxiv.org/abs/2311.18658v1
- Date: Thu, 30 Nov 2023 16:08:04 GMT
- Title: ArcMMLU: A Library and Information Science Benchmark for Large Language
Models
- Authors: Shitou Zhang, Zuchao Li, Xingshen Liu, Liming Yang, Ping Wang
- Abstract summary: This paper introduces ArcMMLU, a benchmark tailored for the Library & Information Science (LIS) domain in Chinese.
This benchmark aims to measure the knowledge and reasoning capability of LLMs within four key sub-domains: Archival Science, Data Science, Library Science, and Information Science.
Our comprehensive evaluation reveals that while most mainstream LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a notable performance gap.
- Score: 25.36473762494066
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In light of the rapidly evolving capabilities of large language models
(LLMs), it becomes imperative to develop rigorous domain-specific evaluation
benchmarks to accurately assess their capabilities. In response to this need,
this paper introduces ArcMMLU, a specialized benchmark tailored for the Library
& Information Science (LIS) domain in Chinese. This benchmark aims to measure
the knowledge and reasoning capability of LLMs within four key sub-domains:
Archival Science, Data Science, Library Science, and Information Science.
Following the format of MMLU/CMMLU, we collected over 6,000 high-quality
questions for the compilation of ArcMMLU. This extensive compilation can
reflect the diverse nature of the LIS domain and offer a robust foundation for
LLM evaluation. Our comprehensive evaluation reveals that while most mainstream
LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a
notable performance gap, suggesting substantial headroom for refinement in LLM
capabilities within the LIS domain. Further analysis explores the effectiveness
of few-shot examples on model performance and highlights challenging questions
where models consistently underperform, providing valuable insights for
targeted improvements. ArcMMLU fills a critical gap in LLM evaluations within
the Chinese LIS domain and paves the way for future development of LLMs
tailored to this specialized area.
Related papers
- Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - Performance Law of Large Language Models [58.32539851241063]
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources.
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
arXiv Detail & Related papers (2024-08-19T11:09:12Z) - FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z) - Linguistic Intelligence in Large Language Models for Telecommunications [5.06945923921948]
Large Language Models (LLMs) have emerged as a significant advancement in the field of Natural Language Processing (NLP)
This study seeks to evaluate the knowledge and understanding capabilities of LLMs within the telecommunications domain.
Our evaluation reveals that zero-shot LLMs can achieve performance levels comparable to the current state-of-the-art fine-tuned models.
arXiv Detail & Related papers (2024-02-24T14:01:07Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.