Fugu-MT 論文翻訳(概要): Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting

論文の概要: Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting

arxiv url: http://arxiv.org/abs/2509.20982v1
Date: Thu, 25 Sep 2025 10:26:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.840666
Title: Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting
Title（参考訳）: 学習環境におけるテキスト入力問題の評価・判定のための指導型LLMの分析
Authors: Valeria Ramirez-Garcia, David de-Fitero-Dominguez, Antonio Garcia-Cabot, Eva Garcia-Lopez,
Abstract要約: LLM(Large Language Model)は、LLM-as-a-JudgeやLLMの微調整といった手法によって研究される評価器として機能する。本稿では,3つのモデルを持つ高校生のコンピュータ科学に関する110の回答をカスタムデータセットで検証した5つの評価システムを提案する。平均絶対偏差 (0.945) と最低根平均正方偏差 (1.214) を人的評価と比較すると, 基準支援評価は見識的, 完全評価とともに公正スコアを提供する。
参考スコア（独自算出の注目度）: 0.7699714865575188
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education, LLMs have been studied as assistant tools for students and teachers. Our research investigates LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics. We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM evaluation, which uses the model's single answer prompt to obtain a score; Reference Aided Evaluation, which uses a correct answer as a guide aside from the original context of the question; No Reference Evaluation, which ommits the reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive Evaluation, which is an evaluation done with generated criteria fitted to each question. All evaluation methods have been compared with the results of a human evaluator. Results show that the best method to automatically evaluate and score Text-Input Problems using LLMs is Reference Aided Evaluation. With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations. Other methods such as Additive and Adaptive Evaluation fail to provide good results in concise answers, No Reference Evaluation lacks information needed to correctly assess questions and JudgeLM Evaluations have not provided good results due to the model's limitations. As a result, we conclude that Artificial Intelligence-driven automatic evaluation systems, aided with proper methodologies, show potential to work as complementary tools to other academic resources.
Abstract（参考訳）: LLM(Large Language Model)は、LLM-as-a-JudgeやLLMの微調整といった手法によって研究される評価器として機能する。教育の分野では、LLMは学生や教師の補助ツールとして研究されている。本研究では,筆跡を用いた学術テキスト入力問題に対するLLM駆動自動評価システムについて検討した。本稿では,3つのモデルを持つ高校生のコンピュータ科学に関する110の回答をカスタムデータセットで検証した5つの評価システムを提案する。評価システムには、モデルの単一回答プロンプトを用いてスコアを得る判定LM評価、質問の本来の文脈とは別に正しい回答をガイドとして利用する参照支援評価、参照回答を省略する参照評価、原子基準を使用する付加評価、各質問に適合した生成された基準で行う適応評価が含まれる。評価手法はすべて、ヒト評価装置の結果と比較された。その結果, LLMを用いてテキスト入力問題を自動的に評価し, 評価する最善の方法は, 参照支援評価であることがわかった。最も低い絶対偏差 (0.945) と最低根の平均平方偏差 (1.214) により、基準支援評価は、見識と完全な評価と同様に公平なスコアを提供する。適応的評価(Adaptive Evaluation)や適応的評価(Adaptive Evaluation)といった他の手法では、簡潔な回答が得られず、参照的評価(No Reference Evaluation)には、質問を正しく評価するために必要な情報が欠けている。その結果,人工知能による自動評価システムには適切な手法が組み込まれており,他の学術資源と相補的なツールとして機能する可能性が示唆された。

論文の概要: Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting

関連論文リスト