Fugu-MT 論文翻訳(概要): A Multi-Task Evaluation of LLMs' Processing of Academic Text Input

論文の概要: A Multi-Task Evaluation of LLMs' Processing of Academic Text Input

arxiv url: http://arxiv.org/abs/2508.11779v1
Date: Fri, 15 Aug 2025 19:05:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.370218
Title: A Multi-Task Evaluation of LLMs' Processing of Academic Text Input
Title（参考訳）: 学術テキスト入力におけるLLM処理のマルチタスク評価
Authors: Tianyi Li, Yu Qin, Olivia R. Liu Sheng,
Abstract要約: 大規模な言語モデル(LLM)が科学的な発見にどの程度役立つか、特に学術的な査読を支援することは熱い議論である。我々は、コンピュータサイエンス研究が別々の用語で採用する個々のタスクを、LLMによる学術テキスト入力の処理を評価するためのガイド付きで堅牢なワークフローにまとめる。コンテンツ再生/比較/修正/修正の4つのタスクをLLMの特定の役割を要求される。
参考スコア（独自算出の注目度）: 6.654906601143054
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How much large language models (LLMs) can aid scientific discovery, notably in assisting academic peer review, is in heated debate. Between a literature digest and a human-comparable research assistant lies their practical application potential. We organize individual tasks that computer science studies employ in separate terms into a guided and robust workflow to evaluate LLMs' processing of academic text input. We employ four tasks in the assessment: content reproduction/comparison/scoring/reflection, each demanding a specific role of the LLM (oracle/judgmental arbiter/knowledgeable arbiter/collaborator) in assisting scholarly works, and altogether testing LLMs with questions that increasingly require intellectual capabilities towards a solid understanding of scientific texts to yield desirable solutions. We exemplify a rigorous performance evaluation with detailed instructions on the prompts. Adopting first-rate Information Systems articles at three top journals as the input texts and an abundant set of text metrics, we record a compromised performance of the leading LLM - Google's Gemini: its summary and paraphrase of academic text is acceptably reliable; using it to rank texts through pairwise text comparison is faintly scalable; asking it to grade academic texts is prone to poor discrimination; its qualitative reflection on the text is self-consistent yet hardly insightful to inspire meaningful research. This evidence against an endorsement of LLMs' text-processing capabilities is consistent across metric-based internal (linguistic assessment), external (comparing to the ground truth), and human evaluation, and is robust to the variations of the prompt. Overall, we do not recommend an unchecked use of LLMs in constructing peer reviews.
Abstract（参考訳）: 大規模な言語モデル(LLM)が科学的な発見に役立つか、特に学術的な査読を支援することは熱い議論である。文献ダイジェストと人間の互換性のある研究アシスタントの間には、その実用的可能性がある。我々は、コンピュータサイエンス研究が別々の用語で採用する個々のタスクを、LLMによる学術テキスト入力の処理を評価するためのガイド付きで堅牢なワークフローにまとめる。コンテンツ再生/比較/修正/修正の4つのタスクを,学術的な研究を支援するためにLLM(oracle/judgmental arbiter/knowledgeable arbiter/collaborator)の特定の役割を要求される。我々は,厳密な性能評価を,プロンプトの詳細な指示で実証する。 3つのトップジャーナルの第一級情報システムの記事を入力テキストと豊富なテキストメトリクスとして採用し、主要なLCMのパフォーマンスを損なうことを記録します。GoogleのGemini: 学術テキストの要約とパラフレーズは間違いなく信頼性が高い; ペアテキスト比較によるテキストのランク付けに使用するのは目立たないほどスケーラブルである; 学級テキストへの質問は差別が貧弱である; テキストに対する質的な反映は自己一貫性があるが、有意義な研究をインスピレーションすることはほとんどない。 LLMのテキスト処理能力の支持に対するこの証拠は、メートル法に基づく内部評価(言語学的評価)、外部評価(基礎的事実と比較)、人的評価に一貫性があり、プロンプトのバリエーションに頑健である。全体としては、ピアレビューの構築においてLLMの未確認使用を推奨しない。

論文の概要: A Multi-Task Evaluation of LLMs' Processing of Academic Text Input

関連論文リスト