ArcMMLU: A Library and Information Science Benchmark for Large Language
Models
- URL: http://arxiv.org/abs/2311.18658v1
- Date: Thu, 30 Nov 2023 16:08:04 GMT
- Title: ArcMMLU: A Library and Information Science Benchmark for Large Language
Models
- Authors: Shitou Zhang, Zuchao Li, Xingshen Liu, Liming Yang, Ping Wang
- Abstract summary: This paper introduces ArcMMLU, a benchmark tailored for the Library & Information Science (LIS) domain in Chinese.
This benchmark aims to measure the knowledge and reasoning capability of LLMs within four key sub-domains: Archival Science, Data Science, Library Science, and Information Science.
Our comprehensive evaluation reveals that while most mainstream LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a notable performance gap.
- Score: 25.36473762494066
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In light of the rapidly evolving capabilities of large language models
(LLMs), it becomes imperative to develop rigorous domain-specific evaluation
benchmarks to accurately assess their capabilities. In response to this need,
this paper introduces ArcMMLU, a specialized benchmark tailored for the Library
& Information Science (LIS) domain in Chinese. This benchmark aims to measure
the knowledge and reasoning capability of LLMs within four key sub-domains:
Archival Science, Data Science, Library Science, and Information Science.
Following the format of MMLU/CMMLU, we collected over 6,000 high-quality
questions for the compilation of ArcMMLU. This extensive compilation can
reflect the diverse nature of the LIS domain and offer a robust foundation for
LLM evaluation. Our comprehensive evaluation reveals that while most mainstream
LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a
notable performance gap, suggesting substantial headroom for refinement in LLM
capabilities within the LIS domain. Further analysis explores the effectiveness
of few-shot examples on model performance and highlights challenging questions
where models consistently underperform, providing valuable insights for
targeted improvements. ArcMMLU fills a critical gap in LLM evaluations within
the Chinese LIS domain and paves the way for future development of LLMs
tailored to this specialized area.
Related papers
- FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z) - Linguistic Intelligence in Large Language Models for Telecommunications [5.06945923921948]
Large Language Models (LLMs) have emerged as a significant advancement in the field of Natural Language Processing (NLP)
This study seeks to evaluate the knowledge and understanding capabilities of LLMs within the telecommunications domain.
Our evaluation reveals that zero-shot LLMs can achieve performance levels comparable to the current state-of-the-art fine-tuned models.
arXiv Detail & Related papers (2024-02-24T14:01:07Z) - Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model.
We employ a proxy model which has far fewer parameters, and take its answers as answers.
Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.