Fugu-MT 論文翻訳(概要): Generative Universal Verifier as Multimodal Meta-Reasoner

論文の概要: Generative Universal Verifier as Multimodal Meta-Reasoner

arxiv url: http://arxiv.org/abs/2510.13804v1
Date: Wed, 15 Oct 2025 17:59:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.805742
Title: Generative Universal Verifier as Multimodal Meta-Reasoner
Title（参考訳）: マルチモーダルメタ共振器としての生成ユニバーサル検証器
Authors: Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang,
Abstract要約: Generative Universal Verifierは、視覚言語モデルと統合マルチモーダルモデルにおける次世代マルチモーダル推論のために設計された新しい概念とプラグインである。 ViVerBenchは、マルチモーダル推論における視覚的結果を評価するために、16のカテゴリにまたがる重要なタスクのベンチマークである。 OmniVerifier-7Bは、ユニバーサルビジュアル検証のために訓練された最初のオムニ対応生成検証器である。
参考スコア（独自算出の注目度）: 71.34250480838473
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.
Abstract（参考訳）: 本稿では,視覚言語モデルと統合マルチモーダルモデルにおける次世代マルチモーダル推論のための新しい概念とプラグインであるGenerative Universal Verifierを紹介する。 1) マルチモーダル推論における視覚的結果を評価するために、16のクリティカルタスクのカテゴリにまたがる包括的なベンチマークであるViVerBenchを構築します。その結果、既存のVLMはこれらのタスクで一貫して性能が劣り、信頼性の高い視覚的検証における人間レベルの能力とはかなりの差があることが判明した。 2) 大規模な視覚的検証データを構築するための2つの自動パイプラインを設計し,ViVerBench(+8.3) 上で有意な利得を達成した最初のオムニ対応生成検証器である OmniVerifier-7B を訓練する。トレーニングを通じて、視覚的検証における3つのアトミックな能力を特定し、それらがどのように一般化し、シナジスティックに相互作用するかを実証する。 (3)OmniVerifier-TTSは,統一モデル内の画像生成と編集にユニバーサル検証を利用する連続的なテスト時間スケーリングパラダイムであり,反復的な微粒化最適化により生成能力の上限を向上する。世代を超えて、普遍検証はより広範な世界モデル間推論シナリオに拡張する。経験的に、OmniVerifier-TTSはT2I-ReasonBench(+3.7)とGenEval++(+4.3)の改善を実現している。信頼性のある視覚的検証を伴うマルチモーダル推論を提供することで、OmniVerifierは、生成時の信頼性のあるリフレクションとスケーラブルなテストタイムリフレクションの両方を前進させ、より信頼性が高く制御可能な次世代推論システムへの一歩を踏み出した。

論文の概要: Generative Universal Verifier as Multimodal Meta-Reasoner

関連論文リスト