Fugu-MT 論文翻訳(概要): Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

論文の概要: Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

arxiv url: http://arxiv.org/abs/2511.18597v1
Date: Sun, 23 Nov 2025 19:39:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-25 18:34:24.918719
Title: Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks
Title（参考訳）: 信頼に足る難易度評価に向けて:プログラミングおよび合成タスクの審査員としての大規模言語モデル
Authors: H. M. Shadman Tabib, Jaber Ahmed Deedar,
Abstract要約: 大規模言語モデル(LLM)は、自然言語とコード生成において印象的な能力を示している。 LLMは、モデル出力と学習活動の自動判断として、ますます多くデプロイされている。自然言語の難易度評価器として純粋に使用される GPT-4o と,明示的な数値的特徴とテキスト的特徴を訓練した解釈可能な Light-GBM アンサンブルを比較した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured tasks such as predicting the difficulty of competitive programming problems remains under-explored. We conduct a systematic comparison of GPT-4o, used purely as a natural-language difficulty assessor, against an interpretable Light-GBM ensemble trained on explicit numeric and textual features. On a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard, LightGBM attains 86% accuracy, whereas GPT-4o reaches only 37.75%. Detailed analyses, including confusion matrices and SHAP-based interpretability, show that numeric constraints -- such as input size limits and acceptance rates -- play a crucial role in separating Hard problems from easier ones. By contrast, GPT-4o often overlooks these cues and exhibits a strong bias toward simpler categories. We further probe GPT-4o through a synthetic Hard-problem generation protocol. Surprisingly, GPT-4o labels almost all of its own synthetic Hard problems as Medium, contradicting its tendency to downgrade real Hard problems to Easy. Our findings connect to recent work on LLMs-as-judges and automatic difficulty estimation in programming and education, and highlight concrete failure modes that must be addressed before LLM-based judges can be considered trustworthy in competitive programming, educational platforms, or reinforcement-learning pipelines.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自然言語とコード生成において印象的な能力を示しており、モデル出力と学習アクティビティの自動判断として、ますます多くデプロイされている。しかし、競合するプログラミング問題の難しさを予測するような構造化されたタスクにおけるそれらの振る舞いは、まだ解明されていない。 GPT-4oを自然言語の難易度評価器として用いて,明示的な数値的特徴とテキスト的特徴を訓練した解釈可能なLight-GBMアンサンブルに対して,系統的に比較を行った。 Easy, Medium, or Hardとラベル付けされた1,825のLeetCodeのデータセットでは、LightGBMの精度は86%、GPT-4oは37.75%である。混乱行列やSHAPに基づく解釈可能性などの詳細な分析は、入力サイズ制限や受け入れ率といった数値的な制約が、ハード問題をより簡単なものから分離する上で重要な役割を担っていることを示している。対照的に、GPT-4oはしばしばこれらの手がかりを見落とし、より単純なカテゴリに対して強い偏見を示す。我々はさらに、合成ハードプロブレム生成プロトコルを通じてGPT-4oを探索する。驚くべきことに、GPT-4oは自身の合成ハードのほとんど全てをMediumとラベル付けしており、実際のハードの問題を簡易にダウングレードする傾向に矛盾している。我々の研究は、近年のLSMs-as-judgesの研究と、プログラミングと教育における自動難易度推定と結びつき、LSMベースの審査員が競合プログラミング、教育プラットフォーム、強化学習パイプラインにおいて信頼できると判断される前に対処しなければならない具体的な障害モードを強調している。

論文の概要: Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

関連論文リスト