Fugu-MT 論文翻訳(概要): HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

論文の概要: HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

arxiv url: http://arxiv.org/abs/2604.09408v1
Date: Fri, 10 Apr 2026 15:21:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.924792
Title: HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
Title（参考訳）: HiL-Bench (Human-in-Loop Benchmark): エージェントはいつ助けを求めるべきか知っているか?
Authors: Mohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Montoya, Nandan Marwaha, Yannis He, Charles Wang, Fernando Crabedo, Alessa Castilo, Bing Liu,
Abstract要約: コーディングエージェントは、完全なコンテキストが与えられたときに複雑なタスクを解決します。現在のベンチマークは、この障害モードに盲目です。我々はこの選択的エスカレーションスキルを測定するためにHiL-Benchを提案する。
参考スコア（独自算出の注目度）: 32.54022440678003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.
Abstract（参考訳）: フロンティアコーディングエージェントは、完全なコンテキストが与えられたときに複雑なタスクを解決しますが、仕様が不完全か曖昧かによって崩壊します。ボトルネックは生の能力ではなく、判断です – いつ自律的に行動すべきか、いつ助けを求めるのかを知ることです。現在のベンチマークは、この障害モードに盲目です。彼らは不明瞭な詳細な指示を提供し、実行の正当性のみを報いるので、不足した要求に対するラッキーな推測をするエージェントは、確実であるように要求されたものと同一のスコアを得る。本稿では、Human-in-the-Loop Benchmark(Human-in-the-Loop Benchmark)を用いて、この選択エスカレーションスキルを測定する。それぞれのタスクには、事前検査ではなく、進歩的な探索を通してのみ表面化する、有能なブロッカ(情報の欠如、曖昧な要求、矛盾した情報)が含まれている。私たちの中核となる指標であるAsk-F1は、質問精度とブロッカーリコールの調和平均であり、過度な推測と無音な推測の緊張を捉えています。 SWEとtext-to-SQLドメインによる評価では、大きな普遍的な判断ギャップが明らかになっている。失敗分析は、3つの重要な助けを探すパターンを識別する: 誤った信念を過度に確信し、ギャップを検知せず、高い不確実性検出と永続的なエラー、そして、自己補正なしで、広範で不正確なエスカレーション。これらの一貫性のあるパターンは、貧弱なヘルプ検索は、タスク固有のものではなく、モデルレベルの欠陥であることを確認した。形状のAsk-F1報酬に対するRLトレーニングは、判断が訓練可能であることを示している。32Bモデルは、ドメイン間で転送されるゲインによって、ヘルプ検索の品質とタスクパス率の両方を改善する。モデルは、いつ尋ねるかのドメイン固有のヒューリスティックを学ばず、解決不可能な不確実性を検出し、それに取り組むことを学習する。

論文の概要: HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

関連論文リスト