DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large
Language Models
- URL: http://arxiv.org/abs/2402.14767v1
- Date: Thu, 22 Feb 2024 18:26:02 GMT
- Title: DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large
Language Models
- Authors: Yuhang Cao, Pan Zhang, Xiaoyi Dong, Dahua Lin, Jiaqi Wang
- Abstract summary: We present DualFocus, a framework for integrating macro and micro perspectives within multi-modal large language models (MLLMs)
We demonstrate DualFocus's superiority in balancing detailed examination with holistic insight, significantly reducing hallucination instances in MLLMs.
- Score: 85.4852517178828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present DualFocus, a novel framework for integrating macro and micro
perspectives within multi-modal large language models (MLLMs) to enhance
vision-language task performance. Current MLLMs typically singularly focus on
inputs at a predefined resolution, resulting in deficiencies in detailed
questions involving local regions. We introduced a DualFocus mechanism where
the model concentrates on the image from a macro perspective, responses to the
question, and identifies suitable sub-regions to zoom in for subsequent micro
perspective analysis. Via the integration of answers from both macro and micro
perspectives, the model is adept at addressing tasks that encompass global,
detailed, and combined considerations. To endows the DualFocus mechanism in
MLLMs, we curated a tailored dataset derived from the Visual Genome (VG) and
adapted it to align with the training regimen of DualFocus. Through comparative
studies across different model sizes and benchmarks, we demonstrate DualFocus's
superiority in balancing detailed examination with holistic insight,
significantly reducing hallucination instances in MLLMs and improving their
performance in various vision-language tasks.
Related papers
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception.
We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective.
Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z) - X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z) - MammothModa: Multi-Modal Large Language Model [17.98445238232718]
We introduce MammothModa, yet another multi-modal large language model (MLLM)
MammothModa consistently outperforms the state-of-the-art models, e.g., LLaVA-series, across main real-world visual language benchmarks without bells and whistles.
arXiv Detail & Related papers (2024-06-26T09:17:27Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z) - MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [42.68425777473114]
Vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity.
We introduce vision-language Model with Multi-Modal In-Context Learning (MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently.
Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks.
arXiv Detail & Related papers (2023-09-14T17:59:17Z) - Global and Local Semantic Completion Learning for Vision-Language
Pre-training [34.740507502215536]
Cross-modal alignment plays a crucial role in vision-language pre-training models.
We propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously.
arXiv Detail & Related papers (2023-06-12T13:20:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.