Fugu-MT 論文翻訳(概要): Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

論文の概要: Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

arxiv url: http://arxiv.org/abs/2603.18765v1
Date: Thu, 19 Mar 2026 11:20:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.109466
Title: Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks
Title（参考訳）: 大規模言語モデルにおける暗黙のグラディングバイアス: 文章スタイルが数学、プログラミング、エッセイタスクにおける自動評価にどのように影響するか
Authors: Rudra Jadhav, Janhavi Danve, Sonalika Shaw,
Abstract要約: 本研究では,大言語モデル (LLM) が,内容の正しさが一定である場合の書き込みスタイルに基づいて,暗黙のグレーディングバイアスを示すか否かを検討する。 2つの最先端のオープンソース LLM は、1-10 スケールでレスポンスをグレードし、内容の正確性のみを評価し、書き込みスタイルを無視するように指示された。以上の結果から,Essay/Writingタスクにおける統計的に有意なグレーディングバイアスが明らかとなった。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.
Abstract（参考訳）: 大規模言語モデル(LLM)は、教育環境における自動化されたグレーダーとしてますます普及しているため、評価の公平性や偏見に対する懸念が重要になっている。本研究は,LLMが内容の正しさを一定に保っている場合の書き込みスタイルに基づいて,暗黙のグレーディングバイアスを示すか否かを考察する。我々は3つの主題(数学、プログラミング、Essay/Writing)で180人の学生の反応を制御したデータセットを構築し、それぞれに3つの面レベルの摂動型(文法エラー、非公式言語、非ネイティブなフレーズ)を配置した。 LLaMA 3.3 70B (Meta) と Qwen 2.5 72B (Alibaba) の2つの最先端のオープンソース LLM は、コンテンツの正確性のみを評価し、書き込みスタイルを無視する明確な指示で、1-10スケールのレスポンスをグレードするよう促された。両モデルおよび全摂動型(p < 0.05)のEssay/Writingタスクにおける統計的に有意なグレーディングバイアスを示し,その効果サイズは媒体(コーエンd = 0.64)から非常に大きい(d = 4.25)。インフォーマル言語は最も重いペナルティを受けており、LLaMAは平均1.90点、Qwenは10点で1.20点を減じている。非ネイティブなフレーズは、それぞれ1.35点と0.90点に罰せられた。対照的に、数学とプログラミングのタスクは最小限のバイアスを示し、ほとんどの条件は統計的に意味をなさない。以上の結果から, LLMグレーディングバイアスは主観的依存, スタイル感受性, 持続的であり, グレーディングプロンプトの明確な反バイアス命令にもかかわらず持続することが示唆された。我々は,LLMに基づくグレーティングシステムの公平な展開の意義を論じ,制度導入前のバイアス監査プロトコルを推奨する。

論文の概要: Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

関連論文リスト