Fugu-MT 論文翻訳(概要): Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

論文の概要: Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

arxiv url: http://arxiv.org/abs/2508.12466v1
Date: Sun, 17 Aug 2025 18:36:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.783537
Title: Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
Title（参考訳）: Inverse-LLaVA:テキスト・ツー・ビジョンマッピングによるアライメント事前学習の除去
Authors: Xuhui Zhan, Tyler Derr,
Abstract要約: Inverse-LLaVAは、ブリッジビジョンと言語モダリティに対する新しいアプローチである。テキスト空間に視覚的特徴を投影するのではなく,テキスト埋め込みを連続的な視覚的表現空間にマッピングする。我々の研究は、計算要求を45%削減する新しいパラダイムの実現性を確立する。
参考スコア（独自算出の注目度）: 10.994141504313689
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.
Abstract（参考訳）: 従来のマルチモーダル学習アプローチでは、視覚と言語のモダリティを橋渡しするために、高価なアライメント事前訓練が必要であり、通常は視覚的特徴を個別のテキストトークン空間に投影する。 Inverse-LLaVAは、従来の写像方向を反転しながらアライメント事前学習を完全に排除する新しいアプローチである。テキスト空間に視覚的特徴を投影するのではなく、連続的な視覚表現空間にテキスト埋め込みをマッピングし、トランスフォーマー中間層内で融合を行う。注意機構の選択的付加成分により、大量の画像テキストアライメントデータセットを必要とせず、視覚的およびテキスト的表現の動的統合を可能にする。 Inverse-LLaVAは推論集約および認知タスク(MM-VET: +0.2%、VizWiz: +1.8%、ScienceQA: +0.2%、認知推論: +27.2%)において顕著な改善を達成しつつ、記憶された視覚テキストアソシエーションを必要とする知覚タスクの減少(セレブ認識:-49.5%、OCR:-21.3%)を示した。これらの結果は、特に複雑な推論タスクにおいて、効果的なマルチモーダル学習にはアライメント事前学習は必要ないという最初の実証的証拠を提供する。我々の研究は、計算要求を45%削減する新しいパラダイムの実現性を確立し、従来のモダリティ融合に関する知恵に挑戦し、モダリティ固有の特性を保った効率的なマルチモーダルアーキテクチャのための新しい研究方向を開く。コードと追加リソースを備えたプロジェクトのWebサイトはhttps://inverse-llava.github.io.comで公開されている。

論文の概要: Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

関連論文リスト