Related papers: Improving Multi-modal Large Language Model through Boosting Vision Capabilities

Improving Multi-modal Large Language Model through Boosting Vision Capabilities

URL: http://arxiv.org/abs/2410.13733v1
Date: Thu, 17 Oct 2024 16:36:38 GMT
Title: Improving Multi-modal Large Language Model through Boosting Vision Capabilities
Authors: Yanpeng Sun, Huaxin Zhang, Qiang Chen, Xinyu Zhang, Nong Sang, Gang Zhang, Jingdong Wang, Zechao Li,
Abstract summary: We focus on improving the visual understanding capability for boosting the vision-language models. We propose textbfArcana, a multiModal language model, which introduces two crucial techniques.
Score: 54.344077285545005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We focus on improving the visual understanding capability for boosting the vision-language models. We propose \textbf{Arcana}, a multiModal language model, which introduces two crucial techniques. First, we present Multimodal LoRA (MM-LoRA), a module designed to enhance the decoder. Unlike traditional language-driven decoders, MM-LoRA consists of two parallel LoRAs -- one for vision and one for language -- each with its own parameters. This disentangled parameters design allows for more specialized learning in each modality and better integration of multimodal information. Second, we introduce the Query Ladder adapter (QLadder) to improve the visual encoder. QLadder employs a learnable ``\textit{ladder}'' structure to deeply aggregates the intermediate representations from the frozen pretrained visual encoder (e.g., CLIP image encoder). This enables the model to learn new and informative visual features, as well as remaining the powerful capabilities of the pretrained visual encoder. These techniques collectively enhance Arcana's visual perception power, enabling it to leverage improved visual information for more accurate and contextually relevant outputs across various multimodal scenarios. Extensive experiments and ablation studies demonstrate the effectiveness and generalization capability of our Arcana. The code and re-annotated data are available at \url{https://arcana-project-page.github.io}.

Related papers

Omni-Video: Democratizing Unified Video Understanding and Generation [13.616454543808798]
This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing.<n>Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders.<n>To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements.
arXiv Detail & Related papers (2025-07-08T16:02:16Z)
EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models [54.234657224615354]
Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training.
arXiv Detail & Related papers (2025-01-06T00:39:31Z)
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models [60.67899965748755]
We present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
arXiv Detail & Related papers (2024-07-27T05:53:37Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter [21.45490901191175]
PaLM2-VAdapter employs a progressively aligned language model as the vision-language adapter. Our method achieves these advancements with 3070% fewer parameters than the state-of-the-art large vision-language models.
arXiv Detail & Related papers (2024-02-16T18:54:47Z)
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities. The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM. Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.