Fugu-MT 論文翻訳(概要): ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

論文の概要: ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

arxiv url: http://arxiv.org/abs/2508.15164v1
Date: Thu, 21 Aug 2025 02:09:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-22 16:26:46.143434
Title: ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following
Title（参考訳）: ContextualLVLM-Agent:多軸視覚対話のための全体的枠組みと複雑な指導
Authors: Seungmin Han, Haeun Kwon, Ji-jun Park, Taeyang Yoon,
Abstract要約: MMDR-Bench(Multi-Modal Dialogue Reasoning Benchmark)は,300の複雑なマルチターン対話シナリオからなる新しいデータセットである。また,既存のLVLMを高度な推論と命令追従機能で拡張する包括的フレームワークであるCoLVLM Agent(Contextual LVLM Agent)を提案する。 MMDR-Benchを用いた実験により,CoLVLM Agentは高い性能を示し,平均評価スコアは4.03。
参考スコア（独自算出の注目度）: 0.2999888908665658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities through an iterative "memory-perception-planning-execution" cycle, requiring no extensive re-training of the underlying models. Our extensive experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT-4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions.
Abstract（参考訳）: LLM(Large Language Models)とLVLM(Large Vision-Language Models)の大幅な進歩にもかかわらず、現在のモデルは、深い推論、持続的なコンテキスト理解、エンティティ追跡、そしてその後のマルチステップ命令を必要とする複雑な、マルチターン、視覚的なタスクを扱う上で大きな課題に直面している。既存のベンチマークは、実世界のマルチモーダル相互作用のダイナミズムと複雑さを捉えるのに不足することが多く、文脈損失や視覚幻覚といった問題に繋がる。 MMDR-Bench(Multi-Modal Dialogue Reasoning Benchmark)は,視覚的実体追跡や推論深度を含む6つのコア次元で,それぞれ平均5-7回転,評価される300個の複雑な多ターン対話シナリオからなる新しいデータセットである。さらに,CLVLMエージェント (Contextual LVLM Agent) を提案する。これは,既存のLVLMを高度な推論と指導能力で拡張し,反復的な「メモリ知覚計画実行」サイクルを通し,基礎となるモデルを広範囲に再学習する必要のない,総合的なフレームワークである。 MMDR-Benchに関する広範な実験により、CoLVLM Agentは、GPT-4o(3.92)やGemini 1.5 Pro(3.85)といった最先端の商用モデルを上回る平均4.03の人的評価スコアを達成し、一貫して優れた性能を発揮することが示された。このフレームワークは、深度、命令の順守、エラー抑制において大きな利点を示し、拡張された対話のターンよりも堅牢な性能を維持し、そのモジュラー設計の有効性と複雑なマルチモーダル相互作用に対する反復的アプローチを検証する。

論文の概要: ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

関連論文リスト