Fugu-MT 論文翻訳(概要): OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

論文の概要: OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

arxiv url: http://arxiv.org/abs/2605.16962v1
Date: Sat, 16 May 2026 12:26:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.391933
Title: OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics
Title（参考訳）: OmniVL-Guard Pro: Omnibus Vision-Language Forensicsのためのツール拡張エージェント
Authors: Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong,
Abstract要約: ツール拡張エージェントである textbf OmniVL-Guard Pro を提案する。高品質なツール推論トラジェクトリを生成するために,textbfTree-Structured Self-Evolving Tool Trajectory Generationを導入する。また,回答が正しいが推論が歪んだ場合に対して,プロセスレベルの監督を行うためのtextbfChecker-Guided Agentic Reinforcement Learningを提案する。
参考スコア（独自算出の注目度）: 63.13200245209719
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at \url{https://github.com/shen8424/OmniVL-Guard-Pro}.
Abstract（参考訳）: 既存の視覚言語による偽造検出とグラウンド法は、モデルだけで検証を完了できると仮定して、クローズドワールドパラダイムの下で動作している。しかし、自己完結型MLLMは有限パラメトリック知識、静的トレーニングコーパス、および限定された知覚分解によって制約されており、特に外部の手がかりを必要とするリアルタイム事象の検証や局所的な操作のきめ細かい精査を必要とする偽セグメンテーションにおいて、動的なオープンワールドの法医学において実践的な天井を形成している。これらの制限に対処するために、私たちは、自己完結したモデルをスケールアップすることから、それを超えるものへとシフトします。本稿では, クローズドワールド予測からオープンワールド手がかり駆動推論まで, 統一法医学を拡張したツール強化エージェントである \textbf{OmniVL-Guard Pro を提案する。 OmniVL-Guard Proは、リアルタイムイベント検索、局所的なトリミングとズーム、エッジアノマリースクリーニング、顔検出、ビデオフレーム抽出、SAM3ベースのセグメンテーションにまたがるツール環境を統合する。高品質なツール推論トラジェクトリを生成するために、シードガイダンス、ガイドなし自己進化、弱い隠れたハードサンプル合成を通じて多様なトラジェクトリを生成する、訓練用フルスペクトルツール推論(FSTR)データセットを提供する、‘textbf{Tree-Structured Self-Evolving Tool Trajectory Generation’を導入する。さらに,回答が正しいが推論が歪んだ場合の罰則をプロセスレベルで監督する「textbf{Checker-Guided Agentic Reinforcement Learning}」(CGARL)を提案する。大規模な実験により、OmniVL-Guard Proは様々なタスクにまたがって最先端のパフォーマンスを達成し、強力なゼロショットの一般化を示す。 FSTRデータセットとOmniVL-Guard Proのコードは、 \url{https://github.com/shen8424/OmniVL-Guard-Pro}で公開される。

論文の概要: OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

関連論文リスト