Related papers: ImageBind-LLM: Multi-modality Instruction Tuning

ImageBind-LLM: Multi-modality Instruction Tuning

URL: http://arxiv.org/abs/2309.03905v2
Date: Mon, 11 Sep 2023 20:25:16 GMT
Title: ImageBind-LLM: Multi-modality Instruction Tuning
Authors: Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao
Abstract summary: ImageBind-LLM is a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. It can respond to audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training.
Score: 70.05191504511188
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

Related papers

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution [71.69364653858447]
Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs.<n>We propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying complexities using different numbers of vision tokens.<n> Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities.
arXiv Detail & Related papers (2025-10-14T17:58:10Z)
MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks [6.603164770657262]
MambaTrans is a novel multimodal fusion image modality translator.<n>It minimizes detection loss during training and captures long-term dependencies among text, masks, and images.<n>Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.
arXiv Detail & Related papers (2025-08-11T09:39:16Z)
Line of Sight: On Linear Representations in VLLMs [44.75626175851506]
We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream.<n>In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs)<n>We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.
arXiv Detail & Related papers (2025-06-05T07:30:58Z)
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z)
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models [44.299894732492696]
Vision-language models (VLMs) handle image-understanding tasks, focusing on how visual information is processed and transferred to the textual domain.<n>We compare native multimodal VLMs, models trained from scratch on multimodal data to generate both text and images, and non-native multimodal VLMs, models adapted from pre-trained large language models or capable of generating only text, highlighting key differences in information flow.<n>We show that ablating a single token significantly deteriorates image-understanding performance, whereas targeted, token-level interventions reliably steer image semantics and downstream text with fine-grained control.
arXiv Detail & Related papers (2024-12-09T16:39:40Z)
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map.
arXiv Detail & Related papers (2024-07-31T11:40:29Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z)
Emu: Generative Pretraining in Multimodality [43.759593451544546]
Transformer-based multimodal foundation model can seamlessly generate images and texts in multimodal context. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks. Emu demonstrates superb performance compared to state-of-the-art large multimodal models.
arXiv Detail & Related papers (2023-07-11T12:45:39Z)
The CLIP Model is Secretly an Image-to-Prompt Converter [26.92989288717742]
The paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts. Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form.
arXiv Detail & Related papers (2023-05-22T04:52:12Z)
Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework. We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image. We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention. We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.