Fugu-MT 論文翻訳(概要): Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

論文の概要: Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2603.18118v1
Date: Wed, 18 Mar 2026 15:28:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:05.781024
Title: Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models
Title（参考訳）: Insight-V++: マルチモーダル大言語モデルによる高度な長鎖ビジュアル推論を目指して
Authors: Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu,
Abstract要約: 基礎画像中心モデルであるInsight-Vから進化した統合多エージェント視覚推論フレームワークを提案する。空間的時間的推論を強化し、評価ロバスト性を向上させる2つの新しいアルゴリズムST-GRPOとJ-GRPOを導入する。 LLaVA-NeXTやQwen2.5-VLといったベースモデルの実験は、挑戦的な画像とビデオの推論ベンチマーク間で大きなパフォーマンス向上を示している。
参考スコア（独自算出の注目度）: 65.4947731385794
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.
Abstract（参考訳）: 大規模言語モデル(LLM)は、拡張されたテスト時間推論を通じて、驚くほどの信頼性と高度な機能を実現している。しかし、これらの機能をMLLM(Multi-modal Large Language Models)に拡張することは、高品質で長いチェーンの推論データと最適化されたトレーニングパイプラインが欠如しているため、依然として大きな課題である。このギャップを埋めるため,基礎となる画像中心モデルであるInsight-Vから一般化された空間時空間アーキテクチャであるInsight-V++へと体系的に進化する,統合されたマルチエージェント視覚推論フレームワークを提案する。まず,人間の介入なしに画像領域とビデオ領域をまたいだ構造化された複雑な推論軌道を自律的に合成する多粒度評価機能を備えたスケーラブルなデータ生成パイプラインを提案する。このような複雑なデータで直接MLLMを監督することは、準最適結果をもたらすことを認識し、我々は、広範囲な解析的連鎖を実行する推論エージェントと、最終結果を批判的に評価し、蒸留する要約エージェントからなる二重エージェントアーキテクチャを設計する。最初のフレームワークでは直接選好最適化(DPO)を利用していたが、その非政治的な性質は強化学習の可能性を根本的に制限した。これらの制限を克服するため、特に長距離ビデオ理解のためにInsight-V++は2つの新しいアルゴリズムST-GRPOとJ-GRPOを導入し、空間的時間的推論を改善し、評価的堅牢性を向上させる。重要なことは、要約エージェントからの信頼性の高いフィードバックを活用することで、反復的推論経路生成プロセスを導出し、連続的な自己改善ループでマルチエージェントシステム全体をトレーニングする。 LLaVA-NeXTやQwen2.5-VLといったベースモデルに対する大規模な実験は、従来の知覚中心のタスクにおいて強力な能力を維持しながら、挑戦的な画像およびビデオ推論ベンチマーク間で大きなパフォーマンス向上を示している。

論文の概要: Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

関連論文リスト