Bridging Compressed Image Latents and Multimodal Large Language Models
- URL: http://arxiv.org/abs/2407.19651v2
- Date: Mon, 17 Feb 2025 15:29:40 GMT
- Title: Bridging Compressed Image Latents and Multimodal Large Language Models
- Authors: Chia-Hao Kao, Cheng Chien, Yu-Jen Tseng, Yi-Hsin Chen, Alessandro Gnutti, Shao-Yuan Lo, Wen-Hsiao Peng, Riccardo Leonardi,
- Abstract summary: This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks.
MLLMs have extended the success of large language models to modalities beyond text, but their billion scale hinders deployment on resource-constrained end devices.
We propose a novel framework with a lightweight transform-neck and a surrogate loss to adapt compressed image latents for MLLM-based vision tasks.
- Score: 45.83457913639876
- License:
- Abstract: This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks that adopt Multimodal Large Language Models (MLLMs). MLLMs have extended the success of large language models to modalities (e.g. images) beyond text, but their billion scale hinders deployment on resource-constrained end devices. While cloud-hosted MLLMs could be available, transmitting raw, uncompressed images captured by end devices to the cloud requires an efficient image compression system. To address this, we focus on emerging neural image compression and propose a novel framework with a lightweight transform-neck and a surrogate loss to adapt compressed image latents for MLLM-based vision tasks. Given the huge scale of MLLMs, our framework excludes the entire downstream MLLM except part of its visual encoder from training our system. This stands out from most existing coding for machine approaches that involve downstream networks in training and thus could be impractical when the networks are MLLMs. The proposed framework is general in that it is applicable to various MLLMs, neural image codecs, and multiple application scenarios, where the neural image codec can be (1) pre-trained for human perception without updating, (2) fully updated for joint human and machine perception, or (3) fully updated for only machine perception. Extensive experiments on different neural image codecs and various MLLMs show that our method achieves great rate-accuracy performance with much less complexity.
Related papers
- SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.
We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.
Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z) - Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion [40.56646959926701]
Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models.
Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders.
We introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs.
arXiv Detail & Related papers (2024-12-02T09:02:28Z) - Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid [87.09900996643516]
We introduce a Complementary Image Pyramid (CIP) to mitigate semantic discontinuity during high-resolution image processing.
We also introduce a Scale Compression Mechanism (SCM) to reduce the additional computational overhead by compressing the redundant visual tokens.
Our experiments demonstrate that CIP can consistently enhance the performance across diverse architectures.
arXiv Detail & Related papers (2024-08-04T13:55:58Z) - SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.
A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.
In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Aligned with LLM: a new multi-modal training paradigm for encoding fMRI
activity in visual cortex [4.57590454144072]
Recently, there has been a surge in the popularity of pre trained large language models (LLMs)
This paper proposes a new multi-modal training paradigm, aligning with LLM, encoding fMRI activity in visual cortex.
arXiv Detail & Related papers (2024-01-08T12:30:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.