Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
- URL: http://arxiv.org/abs/2508.12466v1
- Date: Sun, 17 Aug 2025 18:36:04 GMT
- Title: Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
- Authors: Xuhui Zhan, Tyler Derr,
- Abstract summary: Inverse-LLaVA is a novel approach to bridge vision and language modalities.<n>Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space.<n>Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%.
- Score: 10.994141504313689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.
Related papers
- See, Think, Act: Online Shopper Behavior Simulation with VLM Agents [58.92444959954643]
This paper investigates the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs.<n>We employ SFT for joint action prediction and rationale generation, conditioning on the full interaction context.<n>To further enhance reasoning capabilities, we integrate RL with a hierarchical reward structure, scaled by a difficulty-aware factor.
arXiv Detail & Related papers (2025-10-22T05:07:14Z) - Infusing fine-grained visual knowledge to Vision-Language Models [5.487134463783365]
Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs)<n>We propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge.<n>Our approach consistently achieves strong results, notably retaining the visual-text alignment without utilizing any text data or the original text encoder during fine-tuning.
arXiv Detail & Related papers (2025-08-16T19:12:09Z) - Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs [3.090279286701713]
This paper introduces a simple yet effective intervention: augmenting visual inputs with low-level spatial structures.<n>We empirically demonstrate substantial performance improvements across core visual reasoning tasks.
arXiv Detail & Related papers (2025-06-27T11:44:40Z) - Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [70.57180215148125]
Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
arXiv Detail & Related papers (2025-05-17T10:22:29Z) - DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [61.26026947423187]
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features.<n>Current Multimodal Large Language Models (MLLMs) struggle to integrate reasoning into visual perception.<n>We propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T04:06:34Z) - D-Attn: Decomposed Attention for Large Vision-and-Language Models [29.611769371733672]
We propose Decomposed Attention (D-Attn), a more flexible attention architecture for large vision-and-language models (LVLMs)<n>D-Attn decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions.<n>Experiments and analysis validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks.
arXiv Detail & Related papers (2025-02-04T00:46:11Z) - Advancing Prompt Learning through an External Layer [24.77977865016954]
We propose a paradigm called EnPrompt with a novel External Layer (EnLa)
The learnable external layer is built upon valid embeddings of pre-trained CLIP.
Four experiments demonstrate that our method outperforms the existing prompt learning method.
arXiv Detail & Related papers (2024-07-29T03:30:09Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.