Fugu-MT 論文翻訳(概要): CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

論文の概要: CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

arxiv url: http://arxiv.org/abs/2511.02360v1
Date: Tue, 04 Nov 2025 08:28:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:05.853696
Title: CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
Title（参考訳）: CoCoVa: 遅延空間推論のための継続的ビジョンランゲージ思考のチェーン
Authors: Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan,
Abstract要約: CoCoVaはビジョン言語モデルのための新しいフレームワークで、多種多様な視覚言語タスクに対して連続的なクロスモーダル推論を活用する。 CoCoVaの中核は反復推論サイクルであり、小説『Latent Q-Former』が動的推論エンジンとして機能する。我々は、コントラスト学習と拡散に基づく再構成を組み合わせたマルチタスク目的でモデルを訓練する。
参考スコア（独自算出の注目度）: 22.835301879575002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.
Abstract（参考訳）: 人間の認知には、多くの思考プロセスがあり、言語表現を超えて、さまざまな方法で世界を理解し、対話することができる。しかし、現代の視覚言語モデル(VLM)は、言語トークンの離散的かつ厳密な空間内での推論に制約され続けており、それによって視覚知覚のリッチで高次元的な性質をボトルネックにしている。このギャップを埋めるため,多種多様な視覚言語タスクに対して,連続的な相互モーダル推論を活用するビジョン言語モデルのための新しいフレームワークであるCoCoVa(Chain of Continuous Vision-Language Thought)を提案する。 CoCoVaの中核は反復的推論サイクルであり、新しいラテントQ-フォーマー(LQ-Former)が動的推論エンジンとして機能し、クロスモーダル融合を通じてラテント思考ベクトルの連鎖を反復的に精製する。このプロセスに焦点を合わせるため、トークン選択機構は注意焦点を模倣して、正常な視覚領域を動的に識別する。これらの潜在的思考が根底にあることを保証するため、コントラスト学習と拡散に基づく再構成を組み合わせたマルチタスク目的でモデルを訓練し、潜在的表現と視覚的・テキスト的モダリティの整合性を強制する。評価では、CoCoVaは強いベースラインよりも精度とトークン効率を向上させる。 1.5Bのバックボーンで、ほぼ全てのベンチマークでより大きな7B-9Bモデルと競合する。 7B LLMのバックボーンにスケールすると、最先端のモデルと競合する。定性的分析は、学習した潜時空間が解釈可能な推論パターンと構造化された推論パターンをキャプチャし、離散言語処理と視覚的理解の連続的な性質の間の表現的ギャップを橋渡しするCoCoVaの可能性を強調する。

論文の概要: CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

関連論文リスト