Related papers: Multi-modal Auto-regressive Modeling via Visual Words

Multi-modal Auto-regressive Modeling via Visual Words

URL: http://arxiv.org/abs/2403.07720v2
Date: Mon, 23 Sep 2024 08:51:01 GMT
Title: Multi-modal Auto-regressive Modeling via Visual Words
Authors: Tianshuo Peng, Zuchao Li, Lefei Zhang, Hai Zhao, Ping Wang, Bo Du,
Abstract summary: We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
Score: 96.25078866446053
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification.In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time.Specifically, we propose the concept of visual tokens, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling.We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.

Related papers

Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space [52.34072027212278]
Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation.<n>Recent advances in large foundation models have substantially accelerated the development of embedding models.<n>We present the first systematic study of converting Multimodal dLLMs into embedding models.
arXiv Detail & Related papers (2026-01-19T06:51:15Z)
Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z)
Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z)
Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives [36.297745473653166]
Vision-language modeling (VLM) aims to bridge the information gap between images and natural language.<n>Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress.
arXiv Detail & Related papers (2025-05-20T13:47:40Z)
Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis [21.869968563545736]
We define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input.<n>We introduce a scale-agnostic metric, textitattention accuracy, and a novel benchmark for quantifying IVMs.<n>We extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios.
arXiv Detail & Related papers (2025-05-15T17:52:40Z)
The Multi-Faceted Monosemanticity in Multimodal Representations [42.64636740703632]
We leverage recent advancements in feature monosemanticity to extract interpretable features from deep multimodal models. Our findings reveal that this categorization aligns closely with human cognitive understandings of different modalities. These results indicate that large-scale multimodal models, equipped with task-agnostic interpretability tools, offer valuable insights into key connections and distinctions between different modalities.
arXiv Detail & Related papers (2025-02-16T14:51:07Z)
Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens. We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs. We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z)
The Revolution of Multimodal Large Language Models: A Survey [46.84953515670248]
Multimodal Large Language Models (MLLMs) can seamlessly integrate visual and textual modalities. This paper provides a review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques.
arXiv Detail & Related papers (2024-02-19T19:01:01Z)
Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models. Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z)
Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.