Fugu-MT 論文翻訳(概要): Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance

論文の概要: Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance

arxiv url: http://arxiv.org/abs/2509.05669v1
Date: Sat, 06 Sep 2025 10:14:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:03.643252
Title: Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance
Title（参考訳）: 動的メモリと適応型視覚誘導を用いたLVLMにおけるコンテキスト認識型マルチトゥルビジュアル推論
Authors: Weijie Shen, Xinrui Wang, Yuanqi Nie, Apiradee Boonmee,
Abstract要約: Context-Aware Multi-Turn Visual Reasoning (CAMVR)は、LVLMに堅牢で一貫性のあるマルチターン視覚テキスト推論機能を持たせるように設計されている。我々のマルチレベル推論統合戦略は、応答生成が現在の入力と蓄積された履歴コンテキストの両方と深く一致していることを保証する。
参考スコア（独自算出の注目度）: 2.166625683790549
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) excel in single-turn tasks but face significant challenges in multi-turn interactions requiring deep contextual understanding and complex visual reasoning, often leading to fragmented reasoning, context loss, and hallucinations. To address these limitations, we propose Context-Aware Multi-Turn Visual Reasoning (CAMVR), a novel framework designed to empower LVLMs with robust and coherent multi-turn visual-textual inference capabilities. CAMVR introduces two key innovations: a Visual-Textual Context Memory Unit (VCMU), a dynamic read-write memory network that stores and manages critical visual features, textual semantic representations, and their cross-modal correspondences from each interaction turn; and an Adaptive Visual Focus Guidance (AVFG) mechanism, which leverages the VCMU's context to dynamically adjust the visual encoder's attention to contextually relevant image regions. Our multi-level reasoning integration strategy ensures that response generation is deeply coherent with both current inputs and accumulated historical context. Extensive experiments on challenging datasets, including VisDial, an adapted A-OKVQA, and our novel Multi-Turn Instruction Following (MTIF) dataset, demonstrate that CAMVR consistently achieves state-of-the-art performance.
Abstract（参考訳）: 現在の大規模言語モデル(LLMs)と視覚言語大モデル(LVLMs)は、単一ターンタスクにおいて優れているが、深いコンテキスト理解と複雑な視覚的推論を必要とするマルチターンインタラクションにおいて重大な課題に直面し、しばしば断片化された推論、文脈損失、幻覚へと繋がる。これらの制約に対処するために、我々は、堅牢で一貫性のあるマルチターン視覚テキスト推論機能を備えたLVLMの強化を目的とした新しいフレームワーク、Context-Aware Multi-Turn Visual Reasoning (CAMVR)を提案する。 CAMVRは2つの重要なイノベーションを紹介している: Visual-Textual Context Memory Unit (VCMU)は、重要な視覚特徴、テキストセマンティック表現、およびそれらの相互対応を各インタラクションターンから保存し管理する動的読み書きメモリネットワークであり、また、VCMUのコンテキストを活用して視覚エンコーダの注意をコンテキスト関連の画像領域に動的に調整するAdaptive Visual Focus Guidance (AVFG)メカニズムである。我々のマルチレベル推論統合戦略は、応答生成が現在の入力と蓄積された履歴コンテキストの両方と深く一致していることを保証する。 VisDial、適応型A-OKVQA、新しいMTIFデータセットなど、挑戦的なデータセットに関する大規模な実験は、CAMVRが一貫して最先端のパフォーマンスを達成することを実証している。

論文の概要: Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance

関連論文リスト