Related papers: FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

URL: http://arxiv.org/abs/2404.18359v1
Date: Mon, 29 Apr 2024 01:49:07 GMT
Title: FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models
Authors: Wei Li, Ren Ma, Jiang Wu, Chenya Gu, Jiahui Peng, Jinyang Len, Songyang Zhang, Hang Yan, Dahua Lin, Conghui He,
Abstract summary: This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities.
Score: 64.11333762954283
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.

Related papers

R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation [75.33671166231096]
We introduce a graduate-level, multi-disciplinary, EnglishChinese benchmark, dubbed as Reasoning Bench (R-Bench)<n>RBench spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing.<n>We evaluate widely used models, including OpenAI o1, GPT-4o, DeepSeek-R1, etc.
arXiv Detail & Related papers (2025-05-04T07:48:36Z)
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models [50.43793764203352]
We introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions.
arXiv Detail & Related papers (2025-04-08T08:06:53Z)
ChineseSimpleVQA -- "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models [38.921977141721605]
We introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA. Key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers.
arXiv Detail & Related papers (2025-02-17T12:02:23Z)
Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models [9.761584874383873]
We present Edu-Values, the first Chinese education values evaluation benchmark designed to measure large language models' alignment ability. We meticulously design and compile 1,418 questions, including multiple-choice, multi-modal question answering, subjective analysis, adversarial prompts, and questions on traditional Chinese culture. Due to differences in educational culture, Chinese LLMs significantly outperform English LLMs, with Qwen 2 ranking the first with a score of 81.37.
arXiv Detail & Related papers (2024-09-19T13:02:54Z)
MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models [0.5822010906632046]
This study introduces MultiPragEval, the first pragmatic evaluation of Large Language Models (LLMs) Comprising 1200 question units categorized according to Grice's Cooperative Principle, MultiPragEval enables an in-depth assessment of LLMs' contextual awareness and their ability to infer implied meanings. Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages, establishing a state-of-the-art in the field.
arXiv Detail & Related papers (2024-06-11T21:46:03Z)
Large Language Models are Limited in Out-of-Context Knowledge Reasoning [65.72847298578071]
Large Language Models (LLMs) possess extensive knowledge and strong capabilities in performing in-context reasoning. This paper focuses on a significant aspect of out-of-context reasoning: Out-of-Context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge.
arXiv Detail & Related papers (2024-06-11T15:58:59Z)
Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs) TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
EpiK-Eval: Evaluation for Language Models as Epistemic Models [16.485951373967502]
We introduce EpiK-Eval, a novel question-answering benchmark tailored to evaluate LLMs' proficiency in formulating a coherent and consistent knowledge representation from segmented narratives. We argue that these shortcomings stem from the intrinsic nature of prevailing training objectives. The findings from this study offer insights for developing more robust and reliable LLMs.
arXiv Detail & Related papers (2023-10-23T21:15:54Z)
KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA) We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)
A Survey of Knowledge-Intensive NLP with Pre-Trained Language Models [185.08295787309544]
We aim to summarize the current progress of pre-trained language model-based knowledge-enhanced models (PLMKEs) We present the challenges of PLMKEs based on the discussion regarding the three elements and attempt to provide NLP practitioners with potential directions for further research.
arXiv Detail & Related papers (2022-02-17T17:17:43Z)
Unsupervised Commonsense Question Answering with Self-Talk [71.63983121558843]
We propose an unsupervised framework based on self-talk as a novel alternative to commonsense tasks. Inspired by inquiry-based discovery learning, our approach inquires language models with a number of information seeking questions. Empirical results demonstrate that the self-talk procedure substantially improves the performance of zero-shot language model baselines.
arXiv Detail & Related papers (2020-04-11T20:43:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.