Fugu-MT 論文翻訳(概要): Evidence-Bound Autonomous Research (EviBound): A Governance Framework for Eliminating False Claims

論文の概要: Evidence-Bound Autonomous Research (EviBound): A Governance Framework for Eliminating False Claims

arxiv url: http://arxiv.org/abs/2511.05524v1
Date: Tue, 28 Oct 2025 17:47:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-16 06:38:31.016061
Title: Evidence-Bound Autonomous Research (EviBound): A Governance Framework for Eliminating False Claims
Title（参考訳）: エビデンス・バウンド・自律研究(EviBound):偽主張の排除のためのガバナンス・フレームワーク
Authors: Ruiying Chen,
Abstract要約: EviBoundは、二重ガバナンスゲートを通じて偽のクレームを排除するエビデンスベースの実行フレームワークである。事前実行承認ゲートは、コードが実行される前に受け入れ基準スキーマを検証する。実行後検証ゲートは、MLflow APIクエリを通じてアーティファクトを検証する。
参考スコア（独自算出の注目度）: 0.609170287691728
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based autonomous research agents report false claims: tasks marked "complete" despite missing artifacts, contradictory metrics, or failed executions. EviBound is an evidence-bound execution framework that eliminates false claims through dual governance gates requiring machine-checkable evidence. Two complementary gates enforce evidence requirements. The pre-execution Approval Gate validates acceptance criteria schemas before code runs, catching structural violations proactively. The post-execution Verification Gate validates artifacts via MLflow API queries (with recursive path checking) and optionally validates metrics when specified by acceptance criteria. Claims propagate only when backed by a queryable run ID, required artifacts, and FINISHED status. Bounded, confidence-gated retries (typically 1-2 attempts) recover from transient failures without unbounded loops. The framework was evaluated on 8 benchmark tasks spanning infrastructure validation, ML capabilities, and governance stress tests. Baseline A (Prompt-Level Only) yields 100% hallucination (8/8 claimed, 0/8 verified). Baseline B (Verification-Only) reduces hallucination to 25% (2/8 fail verification). EviBound (Dual Gates) achieves 0% hallucination: 7/8 tasks verified and 1 task correctly blocked at the approval gate, all with only approximately 8.3% execution overhead. This package includes execution trajectories, MLflow run IDs for all verified tasks, and a 4-step verification protocol. Research integrity is an architectural property, achieved through governance gates rather than emergent from model scale.
Abstract（参考訳）: LLMベースの自律的な研究機関は、誤った主張を報告している: タスクは、欠陥のある成果物、矛盾する指標、あるいは実行が失敗したにもかかわらず「完全」である。 EviBoundは、マシンチェック可能なエビデンスを必要とするデュアルガバナンスゲートを通じて、偽のクレームを排除するエビデンスバウンド実行フレームワークである。 2つの補助ゲートは証拠要求を強制する。事前実行承認ゲートは、コードが実行される前に受け入れ基準スキーマを検証し、構造上の違反を積極的にキャッチする。実行後検証ゲートはMLflow APIクエリによるアーティファクトの検証(再帰パスチェック)と、受け入れ基準によって指定されたメトリクスの任意検証を行う。クレームはクエリ可能な実行ID、必要なアーティファクト、FINISHEDステータスによってのみ伝搬する。境界付き、信頼された再試行(典型的には1-2試行)は、非有界ループのない過渡的障害から回復する。このフレームワークは、インフラストラクチャ検証、ML機能、ガバナンスストレステストにまたがる8つのベンチマークタスクで評価された。ベースラインA(Prompt-Level Only)は100%幻覚を生じる(8/8、0/8)。ベースラインB (Verification-Only) は幻覚を25%に減らす(2/8フェイル検証)。 EviBound(Dual Gates)は7/8タスクの検証と1タスクの承認ゲートで正しくブロックされ、実行オーバーヘッドは約8.3%である。このパッケージには、実行軌跡、検証されたすべてのタスクに対するMLflow実行ID、および4ステップの検証プロトコルが含まれている。研究の完全性は、モデルスケールから創発されるのではなく、ガバナンスゲートを通じて達成されるアーキテクチャ上の特性である。

論文の概要: Evidence-Bound Autonomous Research (EviBound): A Governance Framework for Eliminating False Claims

関連論文リスト