Fugu-MT 論文翻訳(概要): How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality

論文の概要: How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality

arxiv url: http://arxiv.org/abs/2604.06756v1
Date: Wed, 08 Apr 2026 07:21:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.393062
Title: How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Title（参考訳）: LLMの回答力の判断に長鎖がどの程度影響するか
Authors: Minzhu Tu, Shiyu Ni, Keping Bi,
Abstract要約: 大規模言語モデル(LLM)は、人間の評価のためのスケーラブルなサロゲートとして広く採用されているが、そのような判断は依然として不完全であり、表面レベルの偏見に影響を受けやすい。推論可能なモデルの増加に伴い、ジェネレータの推論内容を判断者に公開することで、よりリッチな情報を提供し、判定精度を向上させるための自然な候補となる。弱い裁判官は存在を推論することで容易に揺れるが、強い裁判官は情報的証拠として推論を部分的に活用できる。
参考スコア（独自算出の注目度）: 9.19183567561999
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator's reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.
Abstract（参考訳）: 大規模言語モデル(LLM)は、人間の評価のためのスケーラブルなサロゲートとして広く採用されているが、そのような判断は依然として不完全であり、表面レベルの偏見に影響を受けやすい。一つの考えられる理由は、これらの裁判官が答えの正しさを評価するのに十分な情報を持っていないことである。推論可能なモデルの増加に伴い、ジェネレータの推論内容を判断者に公開することで、よりリッチな情報を提供し、判定精度を向上させるための自然な候補となる。しかし、実際の判断行動への影響は未検討のままである。本稿では,推論連鎖へのアクセスが実数質問応答(QA)と数理推論ベンチマーク間のLCMに基づく判断にどう影響するかを系統的に検討する。弱い判断者は存在を推論することで容易に振る舞い、流動的な推論を伴う誤った答えを頻繁に受け入れ、強い判断は情報的な証拠として推論を部分的に活用できる。それでも、強い裁判官でさえ、一見高品質な推論チェーンによって誤解される。制御された実験により、推論チェーンの流布と事実の両方が、判断を下す重要なシグナルであることが明らかになった。これらの知見は、現代の推論モデルを評価する際に、表面流速と真の推論品質を区別できる、より堅牢なLCM裁判官の必要性を浮き彫りにしている。

論文の概要: How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality

関連論文リスト