Fugu-MT 論文翻訳(概要): AcademicEval: Live Long-Context LLM Benchmark

論文の概要: AcademicEval: Live Long-Context LLM Benchmark

arxiv url: http://arxiv.org/abs/2510.17725v1
Date: Mon, 20 Oct 2025 16:42:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.532884
Title: AcademicEval: Live Long-Context LLM Benchmark
Title（参考訳）: AcademicEval: Live Long-Context LLMベンチマーク
Authors: Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You,
Abstract要約: textscAcademicEvalは、長期コンテキスト生成タスクよりも大きな言語モデル(LLM)を評価するためのベンチマークである。 textscAcademicEval は arXiv に関する論文を採用し、長いコンテキスト入力を持つ学術的な記述タスクをいくつか導入している。我々はtextscAcademicEval の全体的評価を行い,LLM が階層的な抽象レベルを持つタスクに対して不十分に動作することを示す。
参考スコア（独自算出の注目度）: 27.016001804846905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at https://github.com/ulab-uiuc/AcademicEval
Abstract（参考訳）: 大規模言語モデル(LLM)は最近、長いコンテキスト理解において顕著なパフォーマンスを達成した。しかし、現在のLLMベンチマークは、厳密な文脈長、労働集約的なアノテーション、LLMトレーニング中のラベル漏洩問題に対する圧力のかかる課題によって制限されている。そこで本稿では,LLMを長文生成タスクで評価するためのライブベンチマークである‘textsc{AcademicEval} を提案する。 \textsc{AcademicEval} は arXiv の論文を採用し、長いコンテキスト入力を持ついくつかの学術的な記述タスク、 \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, \textsc{Related Work} を導入している。さらに、‘textsc{AcademicEval} は、収集した共著者グラフから、高品質で専門家による数ショットのデモを統合することで、柔軟なコンテキスト長を実現する。特に、‘textsc{AcademicEval} は効率的なライブ評価を特徴とし、ラベルのリークを確実にする。この結果から,LLMは階層的な抽象レベルを持つタスクでは不十分であり,長時間のデモに苦しむ傾向を示し,ベンチマークの課題を浮き彫りにしている。実験分析により,LLMの長文モデリング能力を高めるための知見も明らかにした。コードはhttps://github.com/ulab-uiuc/AcademicEvalで入手できる。

論文の概要: AcademicEval: Live Long-Context LLM Benchmark

関連論文リスト