LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding
- URL: http://arxiv.org/abs/2510.16783v1
- Date: Sun, 19 Oct 2025 10:15:42 GMT
- Title: LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding
- Authors: Sheikh Jubair, Arwa Omayrah, Amal Alshammari, Alhanoof Althnian, Abdulhamed Alothaimen, Norah A. Alzahrani, Shahad D. Alzaidi, Nora Al-Twairesh, Abdulmohsen Al-Thubaity,
- Abstract summary: We present textbfLC-Eval, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic.<n>The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres.
- Score: 0.4837072536850575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.
Related papers
- NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models [7.134358758293254]
The Needle-in-a-Haystack benchmark is widely used to evaluate Large Language Models' (LLMs) ability to understand long contexts (LC)<n>We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences.<n>We introduce a novel benchmark, textbfNeedleChain, where the context consists entirely of query-relevant information.
arXiv Detail & Related papers (2025-07-30T06:29:50Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [85.78821098963607]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation [12.848952248427977]
Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering tasks.<n>In multilingual RAG, retrieved passages can be written in languages other than that of the query entered by the user.
arXiv Detail & Related papers (2025-04-01T09:55:23Z) - XIFBench: Evaluating Large Language Models on Multilingual Instruction Following [59.549015333755186]
Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications.<n>Existing evaluations lack fine-grained constraint analysis across diverse linguistic contexts.<n>We introduce XIFBench, a comprehensive benchmark for evaluating multilingual instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-03-10T17:07:52Z) - On Many-Shot In-Context Learning for Long-Context Evaluation [10.500629810624769]
This paper delves into long-context language model evaluation through many-shot ICL.<n>We develop metrics to categorize ICL tasks into two groups: similar-sample learning (SSL) and all-sample learning (ASL)<n>We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.
arXiv Detail & Related papers (2024-11-11T17:00:59Z) - ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding [15.93642619347214]
We introduce proverbeval, LLM evaluation benchmark for low-resource languages.<n>Native language proverb descriptions significantly improve tasks such as proverb generation.<n> monolingual evaluations consistently outperformed their cross-lingual counterparts in generation tasks.
arXiv Detail & Related papers (2024-11-07T06:34:48Z) - Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset [7.954348293179786]
We propose CFLUE, a benchmark to assess the capability of large language models (LLMs) across various dimensions.
In knowledge assessment, it consists of 38K+ multiple-choice questions with associated solution explanations.
In application assessment, it features 16K+ test instances across distinct groups of NLP tasks such as text classification, machine translation, relation extraction, reading comprehension, and text generation.
arXiv Detail & Related papers (2024-05-17T05:03:40Z) - NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
We introduce NovelQA, a benchmark tailored for evaluating Large Language Models (LLMs) with complex, extended narratives.<n>NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding.<n>Our evaluation of long-context LLMs on NovelQA reveals significant insights into their strengths and weaknesses.
arXiv Detail & Related papers (2024-03-18T17:32:32Z) - BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length.
We propose BAMBOO, a multi-task long context benchmark.
It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.