Fugu-MT 論文翻訳(概要): How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows

論文の概要: How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows

arxiv url: http://arxiv.org/abs/2604.00008v1
Date: Mon, 09 Mar 2026 15:22:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:13.183181
Title: How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows
Title（参考訳）: LLM-as-Judge Ratings for Interpretive Responses? : 質的研究ワークフローへの示唆
Authors: Songhee Han, Jueun Shin, Jiyoon Han, Bung-Woo Jun, Hilal Ayan Karabatman,
Abstract要約: 本研究では,LLM-as-judge評価が解釈品質の人的判断と有意に一致しているかどうかを検討する。 5つの広く採用されている推論モデルを用いて一文解釈応答を生成した。その結果, LLM-as-judgeスコアは, モデルレベルでの人間の評価において, 幅広い方向の傾向をとらえるが, スコアの程度は著しく異なることがわかった。
参考スコア（独自算出の注目度）: 0.6437935154416734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As qualitative researchers show growing interest in using automated tools to support interpretive analysis, a large language model (LLM) is often introduced into an analytic workflow as is, without systematic evaluation of interpretive quality or comparison across models. This practice leaves model selection largely unexamined despite its potential influence on interpretive outcomes. To address this gap, this study examines whether LLM-as-judge evaluations meaningfully align with human judgments of interpretive quality and can inform model-level decision making. Using 712 conversational excerpts from semi-structured interviews with K-12 mathematics teachers, we generated one-sentence interpretive responses using five widely adopted inference models: Command R+ (Cohere), Gemini 2.5 Pro (Google), GPT-5.1 (OpenAI), Llama 4 Scout-17B Instruct (Meta), and Qwen 3-32B Dense (Alibaba). Automated evaluations were conducted using AWS Bedrock's LLM-as-judge framework across five metrics, and a stratified subset of responses was independently rated by trained human evaluators on interpretive accuracy, nuance preservation, and interpretive coherence. Results show that LLM-as-judge scores capture broad directional trends in human evaluations at the model level but diverge substantially in score magnitude. Among automated metrics, Coherence showed the strongest alignment with aggregated human ratings, whereas Faithfulness and Correctness revealed systematic misalignment at the excerpt level, particularly for non-literal and nuanced interpretations. Safety-related metrics were largely irrelevant to interpretive quality. These findings suggest that LLM-as-judge methods are better suited for screening or eliminating underperforming models than for replacing human judgment, offering practical guidance for systematic comparison and selection of LLMs in qualitative research workflows.
Abstract（参考訳）: 定性的な研究者は、解釈分析をサポートするために自動化ツールを使うことへの関心が高まっているため、大規模な言語モデル(LLM)は、解釈品質の体系的な評価やモデル間の比較なしに、分析ワークフローにしばしば導入される。この慣行は、解釈結果に潜在的な影響があるにもかかわらず、モデル選択をほとんど検討しないままである。このギャップに対処するために,LLM-as-judge評価が解釈品質の人間の判断と有意に一致し,モデルレベルの意思決定を通知できるかどうかを検討する。 K-12教師の半構造化インタビューから712件の抜粋を用いて,コマンドR+ (Cohere), Gemini 2.5 Pro (Google), GPT-5.1 (OpenAI), Llama 4 Scout-17B Instruct (Meta), Qwen 3-32B Dense (Alibaba) の5つの広く採用されている推論モデルを用いて,一文解釈応答を生成した。自動評価は、AWS BedrockのLLM-as-judgeフレームワークを5つのメトリクスにわたって使用し、トレーニングされた人間の評価者によって、解釈精度、ニュアンス保存、解釈コヒーレンスについて独立して評価された。その結果, LLM-as-judgeスコアは, モデルレベルでの人間の評価において, 幅広い方向の傾向をとらえるが, スコアの程度は著しく異なることがわかった。自動測定では、コヒーレンスは人間の評価と最強の一致を示したが、忠実さと正確さは、特に非文学的・ニュアンス的な解釈において、抜粋レベルで体系的な不整合を示した。安全性に関する指標は、解釈品質とは無関係であった。これらの結果から, LLM-as-judge法は, 定性的な研究ワークフローにおけるLLMの体系的比較と選択のための実践的ガイダンスとして, 人間の判断に取って代わるよりも, 性能の低いモデルのスクリーニングや除去に適していることが示唆された。

論文の概要: How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows

関連論文リスト