Fugu-MT 論文翻訳(概要): LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

論文の概要: LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

arxiv url: http://arxiv.org/abs/2510.16783v1
Date: Sun, 19 Oct 2025 10:15:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.142992
Title: LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding
Title（参考訳）: LC-Eval:長期理解のためのバイリンガルマルチタスク評価ベンチマーク
Authors: Sheikh Jubair, Arwa Omayrah, Amal Alshammari, Alhanoof Althnian, Abdulhamed Alothaimen, Norah A. Alzahrani, Shahad D. Alzaidi, Nora Al-Twairesh, Abdulmohsen Al-Thubaity,
Abstract要約: 英語とアラビア語の長文理解を評価するために設計されたバイリンガル・マルチタスク評価ベンチマークである textbfLC-Eval を提案する。このベンチマークには、各タスクのアラビア語と英語の両方のデータセットが含まれており、異なるテキストジャンルにわたるパフォーマンスの比較分析を可能にする。
参考スコア（独自算出の注目度）: 0.4837072536850575
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.
Abstract（参考訳）: 近年のLLM(Large Language Models)の進歩は、拡張コンテキストの処理や理解など、高度な機能を示している。これらの創発的能力は、長期的理解においてその性能を効果的に評価するために厳密な評価方法を必要とする。本稿では,4kから128k以上のトークンを対象に,英語とアラビア語の長文理解を評価するために設計されたバイリンガル・マルチタスク評価ベンチマークである \textbf{LC-Eval} を提案する。 LC-Evalは、多文書質問応答、バイリンガル質問応答、段落内のクレーム検証、長い文脈に基づく複数選択質問の4つの斬新で挑戦的なタスクを導入している。これらのタスクは、深い推論、文書理解、情報追跡、バイリンガル情報抽出および理解におけるLLMの能力を評価するように設計されている。このベンチマークには、各タスクのアラビア語と英語の両方のデータセットが含まれており、異なるテキストジャンルにわたるパフォーマンスの比較分析を可能にする。 LC-Eval はオープンウェイトおよびクローズド LLM の両方で評価を行い,LC-Eval が重要な課題であることを示した。 GPT-4oのような高性能なモデルでさえ、ベンチマークの複雑さと厳密さを強調し、特定のタスクに苦労した。

論文の概要: LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

関連論文リスト