Related papers: ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

URL: http://arxiv.org/abs/2505.16566v1
Date: Thu, 22 May 2025 11:59:06 GMT
Title: ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts
Authors: Dongwon Noh, Donghyeok Koh, Junghun Yuk, Gyuwan Kim, Jaeyong Lee, Kyungtae Lim, Cheoneum Park,
Abstract summary: textttScholarBench is a benchmark for evaluating the academic reasoning ability of large language models (LLMs)<n>The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543.
Score: 13.79519099452634
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

Related papers

TASE: Token Awareness and Structured Evaluation for Multilingual Language Models [8.058965963418785]
TASE is a benchmark designed to evaluate large language models' ability to perceive and reason about token-level information.<n> TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean.<n>We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1.
arXiv Detail & Related papers (2025-08-07T15:11:17Z)
OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases [38.58409057214189]
textbftextscOneEval is a benchmark to assess the knowledge-intensive reasoning capabilities of Large Language Models (LLMs)<n>textscOneEval comprises 4,019 carefully curated instances and includes a challenging subset, textscOneEvaltextsubscriptHard, consisting of 1,285 particularly difficult cases.<n>We release the textscOneEval datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.
arXiv Detail & Related papers (2025-06-14T17:16:05Z)
Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages [17.028968054304947]
MSumBench is a multi-dimensional, multi-domain evaluation of summarization in English and Chinese.<n>By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages.
arXiv Detail & Related papers (2025-05-31T13:12:35Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z)
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation [20.87296508045343]
We introduce Fuxi, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks.<n>We reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks.<n>Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development.
arXiv Detail & Related papers (2025-03-20T04:26:40Z)
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking [55.81461218284736]
EquiBench is a new benchmark for evaluating large language models (LLMs)<n>It determines whether two programs produce identical outputs for all possible inputs.<n>We evaluate 19 state-of-the-art LLMs and find that the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context [0.4194295877935868]
We present the L3Cube-IndicQuest, a gold-standard factual question-answering benchmark dataset. The dataset contains 200 question-answer pairs, each for English and 19 Indic languages, covering five domains specific to the Indic region.
arXiv Detail & Related papers (2024-09-13T10:48:35Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? [20.476500441734427]
Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks. Their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.
arXiv Detail & Related papers (2023-09-14T06:41:58Z)
On the Evaluation of Neural Code Translation: Taxonomy and Benchmark [12.431884660186281]
We develop a taxonomy that categorizes code translation tasks into four primary types according to their complexity and knowledge dependence. We then conduct a thorough analysis of how existing approaches perform across these four categories. Our findings indicate that while state-of-the-art code translation models excel in type-1 and type-2 translations, they struggle with knowledge-dependent ones such as type-3 and type-4.
arXiv Detail & Related papers (2023-08-17T13:05:27Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named Entity Recognition using Knowledge Bases [53.054598423181844]
We present a novel NER cascade approach comprising three steps. We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities. Our system exhibits robust performance in the MultiCoNER2 shared task, even in the low-resource language setting.
arXiv Detail & Related papers (2023-04-20T20:30:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.