Fugu-MT 論文翻訳(概要): Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

論文の概要: Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

arxiv url: http://arxiv.org/abs/2510.22389v1
Date: Sat, 25 Oct 2025 18:12:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 17:41:21.95878
Title: Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?
Title（参考訳）: 大規模言語モデルの小型化と推論は, 研究品質と平均化と一発的支援を両立させるか?
Authors: Mike Thelwall, Ehsan Mohammadi,
Abstract要約: LLMと推論モデルが類似する能力を持つかどうかは不明である。より大きなモデルは、いくつかの状況では遅く非現実的であり、推論モデルは異なるパフォーマンスを示す可能性があるため、これは重要である。関連する4つの質問は、Gemma3の亜種であるLlama4 Scout、Qwen3、Magistral Small、DeepSeek R1で対処されている。結果は、より小さい (オープンウェイト) と推論 LLM が ChatGPT 4o-mini や Gemini 2.0 Flash と同等の性能を持つことを示唆している。
参考スコア（独自算出の注目度）: 3.920564895363768
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.
Abstract（参考訳）: 刊行された学術論文を評価することは、部門や個人の評価に共通する課題である。引用データによってサポートされることもあるが、Large Language Models (LLMs) は記事の品質のより有用な指標を提供する。この能力の証拠は、ChatGPT と Gemini の2つの大きな LLM ファミリーと中型の LLM Gemma3 27b に存在しているが、より小さい LLM と推論モデルが類似する能力を持つかどうかは不明である。より大きなモデルは、いくつかの状況では遅く非現実的であり、推論モデルは異なるパフォーマンスを示す可能性があるため、これは重要である。 Gemma3の変種であるLlama4 Scout、Qwen3、Magistral Small、DeepSeek R1では、6つの分野の2,780の医学・健康・生命科学論文のデータセットに2つの異なるゴールド標準と1つの小説がある。結果は、小さい (オープンウェイト) と推論 LLM が ChatGPT 4o-mini や Gemini 2.0 Flash と同じような性能を持つことを示唆している。さらに、複数の同一クエリからのスコアの平均化は普遍的に成功した戦略であり、数発のプロンプト(4つの例)は役に立つ傾向にあったが、証拠は同等であった。推論モデルは明確な優位性を持っていなかった。その結果,理論モデルを含む小型のLCM >4bは,特に平均値を用いた場合,研究品質に関する論文を採点する能力を持つことがわかった。

関連論文リスト

Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIGは、タイトル、抽象、および関連する作品からコヒーレントな紹介を生成するLLMの能力を評価するタスクである。オープンソース (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) およびクローズドソース GPT-4o システムを含む5つの最先端モデルを評価する。結果は、特に意味的類似性と忠実性において、ほとんどのメトリクスにおいて、LLaMA-4 Maverickの優れたパフォーマンスを示す。
論文参考訳（メタデータ） (2025-08-19T21:11:11Z)
An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
大規模言語モデル(LLM)は強い多言語能力を示しており、実アプリケーションでM2MS(Multi-to-Many summarization)を実行する可能性を秘めている。本研究は,LLMのM2MS能力に関する系統的研究である。
論文参考訳（メタデータ） (2025-05-19T11:18:54Z)
Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models [0.18846515534317265]
汎用大規模言語モデル(LLM)は、しばしば幻覚に苦しむ。この課題は、応答の正確さ、妥当性、忠実さを維持しながらドメイン固有の知識を統合するシステムの必要性を強調している。本研究は,114サラーの意味,歴史的文脈,質など,クラーニック・サラーの記述的データセットを利用する。モデルは、文脈関連性、回答忠実性、回答関連性という、人間の評価者によって設定された3つの重要な指標を用いて評価される。
論文参考訳（メタデータ） (2025-03-20T13:26:30Z)
Large Language Models as Misleading Assistants in Conversation [8.557086720583802]
本稿では,Large Language Models (LLMs) の読解作業における支援の文脈において,誤認する能力について検討する。我々は,(1)モデルが真理的な援助を提供するよう促された場合,(2)モデルが微妙に誤解を招くよう促された場合,(3)間違った回答を求めるよう促された場合,の結果を比較した。
論文参考訳（メタデータ） (2024-07-16T14:45:22Z)
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models [56.02275285521847]
LLm評価器のパネル(PoLL)を用いた評価モデルを提案する。より多数の小さなモデルで構成されたPoLLは,1つの大判定器より優れ,不整合モデルファミリーの構成によるモデル内バイアスが小さく,しかも7倍以上のコストがかかる。
論文参考訳（メタデータ） (2024-04-29T15:33:23Z)
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation [32.0844209512788]
大型言語モデル(LLM)は小型言語よりも優れているという考え方が一般的である。両方のモデルが同じ予算の下で動作した場合、どうなるのか? 我々は、様々なサイズのコード生成LLMを分析し、70Bモデルを実行する場合と13Bモデルから5つの出力を生成する場合の比較を行う。
論文参考訳（メタデータ） (2024-03-31T15:55:49Z)
Can Large Language Models Automatically Score Proficiency of Written Essays? [3.993602109661159]
大規模言語モデル(LLMs)は、様々なタスクにおいて異常な能力を示すトランスフォーマーベースのモデルである。我々は,LLMの強力な言語知識を活かして,エッセイを分析し,効果的に評価する能力をテストする。
論文参考訳（メタデータ） (2024-03-10T09:39:00Z)
Specializing Smaller Language Models towards Multi-Step Reasoning [56.78474185485288]
GPT-3.5 (ge$ 175B) から T5 変種 (le$ 11B) までを蒸留できることを示す。対象タスクに対するモデルの能力を専門化するモデル特殊化を提案する。
論文参考訳（メタデータ） (2023-01-30T08:51:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。