Fugu-MT 論文翻訳(概要): Comparison of Scoring Rationales Between Large Language Models and Human Raters

論文の概要: Comparison of Scoring Rationales Between Large Language Models and Human Raters

arxiv url: http://arxiv.org/abs/2509.23412v1
Date: Sat, 27 Sep 2025 16:58:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.213291
Title: Comparison of Scoring Rationales Between Large Language Models and Human Raters
Title（参考訳）: 大規模言語モデルとヒューマンレーダのスコーリング合理化の比較
Authors: Haowei Hua, Hong Jiao, Dan Song,
Abstract要約: 本研究では,評価の不整合性の原因を明らかにするために,人間とLLMラッカーの理性について検討した。大規模試験から得られたエッセイを用いて, GPT-4o, Geminiおよびその他のLLMの評価精度を検討した。コサイン類似性は、与えられた有理量の類似性を評価するために用いられる。
参考スコア（独自算出の注目度）: 3.4283859937936705
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advances in automated scoring are closely aligned with advances in machine-learning and natural-language-processing techniques. With recent progress in large language models (LLMs), the use of ChatGPT, Gemini, Claude, and other generative-AI chatbots for automated scoring has been explored. Given their strong reasoning capabilities, LLMs can also produce rationales to support the scores they assign. Thus, evaluating the rationales provided by both human and LLM raters can help improve the understanding of the reasoning that each type of rater applies when assigning a score. This study investigates the rationales of human and LLM raters to identify potential causes of scoring inconsistency. Using essays from a large-scale test, the scoring accuracy of GPT-4o, Gemini, and other LLMs is examined based on quadratic weighted kappa and normalized mutual information. Cosine similarity is used to evaluate the similarity of the rationales provided. In addition, clustering patterns in rationales are explored using principal component analysis based on the embeddings of the rationales. The findings of this study provide insights into the accuracy and ``thinking'' of LLMs in automated scoring, helping to improve the understanding of the rationales behind both human scoring and LLM-based automated scoring.
Abstract（参考訳）: 自動スコアリングの進歩は、機械学習と自然言語処理技術の進歩と密接に一致している。近年の大規模言語モデル(LLM)の発展に伴い、ChatGPT、Gemini、Claudeなどの生成AIチャットボットによる自動スコアリングが検討されている。強い推論能力を考えると、LLMは割り当てたスコアをサポートする合理性も生み出すことができる。したがって、人間とLLMのレーダが提示する合理性を評価することは、スコアを割り当てる際、各タイプのレーダが適用する推論の理解を改善するのに役立つ。本研究では,評価の不整合性の原因を明らかにするために,人間とLLMラッカーの理性について検討した。大規模実験から得られたエッセイを用いて, 2次重み付きカッパと正規化相互情報に基づいて, GPT-4o, Gemini, その他のLCMのスコアリング精度を検討した。コサイン類似性は、与えられた有理量の類似性を評価するために用いられる。さらに,有理数の埋め込みに基づく主成分分析を用いて,有理数のクラスタリングパターンを探索する。本研究は、自動スコアリングにおけるLLMの精度と「思考」に関する知見を提供し、人間のスコアリングとLLMに基づく自動スコアリングの両方の背景にある理論的根拠の理解を向上させるのに役立つ。

論文の概要: Comparison of Scoring Rationales Between Large Language Models and Human Raters

関連論文リスト