Fugu-MT 論文翻訳(概要): AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

論文の概要: AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

arxiv url: http://arxiv.org/abs/2604.11950v1
Date: Mon, 13 Apr 2026 18:44:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.065909
Title: AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection
Title（参考訳）: AnyPoC: スケーラブルLLMベースバグ検出のための概念実証テスト生成
Authors: Zijie Zhao, Chenyuan Yang, Weidong Wang, Yihan Yang, Ziqi Zhang, Lingming Zhang,
Abstract要約: 我々は、AnyPoCという一般的なマルチエージェントフレームワークを紹介し、バグレポートを分析し、ファクトチェックする。 AnyPoCは真陽性のバグレポートに対して1.3倍有効なPoCを生成し、偽陽性のバグレポートの9.8倍を拒否する。これまでにAnyPoCは122の新たなバグ(確認105、すでに86)を発見し、45の生成されたPoCを公式回帰テストとして採用している。
参考スコア（独自算出の注目度）: 21.99631570872796
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted input - to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end-to-end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward "success" and may reward-hack by producing plausible but non-functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi-agent framework that (1) analyzes and fact-checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re-executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state-of-the-art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true-positive bug reports and rejects 9.8x more false-positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.
Abstract（参考訳）: 最近のLSMベースのエージェントはソースコードの多くのバグを識別できるが、その報告は手動による検証を必要とする静的な仮説のままであり、自動バグ検出の実用性を制限する。候補レポート、実行可能な概念実証テスト、あるいは単にPoC(スクリプト、コマンドシーケンス、クラフトインプットなど)を合成して、疑わしい欠陥をトリガーします。自動PoC生成はスケーラブルな検証オラクルとして機能し、具体的な実行証拠を提供することで、エンドツーエンドの自律的なバグ検出を可能にする。しかし、ナイーブなLLMエージェントは信頼できないバリケーターであり、それらは「成功」に偏りがあり、可塑性だが非機能的なPoCや幻覚された痕跡を生産することで報酬を得られる。そこで本研究では,(1)対象のバグレポートを分析し,ファクトチェックを行う一般的なマルチエージェントフレームワークであるAnyPoCを提案する。(2) 実行トレースを収集しながらPoCを反復的に合成し,実行し,(3) 幻覚と報酬のハッキングを軽減するためにPoCを独立に再実行し,精査する。さらに、AnyPoCは、異種タスクを処理するためにPoC知識ベースを継続的に抽出し、進化させます。 AnyPoCは、ソースに関わらず、候補となるバグレポートで動作し、異なるバグレポーターとペアリングすることができる。実用性と汎用性を示すため、AnyPoCは、Firefox、Chromium、LLVM、OpenSSL、SQLite、FFmpeg、Redisなど、さまざまな言語/ドメイン(多くは数百万行のコード)にまたがる12のクリティカルなソフトウェアシステムに適用します。最先端のコーディングエージェントであるClaude CodeやCodexと比較すると、AnyPoCは真陽性のバグレポートに対して1.3倍有効なPoCを生成し、偽陽性のバグレポートを9.8倍拒否する。これまでにAnyPoCは122の新たなバグ(確認105、すでに86)を発見し、45の生成されたPoCを公式回帰テストとして採用している。

論文の概要: AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

関連論文リスト