EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
- URL: http://arxiv.org/abs/2410.05938v1
- Date: Tue, 8 Oct 2024 11:41:55 GMT
- Title: EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
- Authors: Yifei Xing, Xiangyuan Lan, Ruiping Wang, Dongmei Jiang, Wenjun Huang, Qingfang Zheng, Yaowei Wang,
- Abstract summary: We propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA) to extract fine-grained visual information.
Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference.
- Score: 39.870809905905325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code will be provided.
Related papers
- SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.
We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.
Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z) - MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation [3.64388407705261]
We propose a Multi-Scale Vision Mamba UNet model for medical image segmentation, termed MSVM-UNet.
Specifically, by introducing multi-scale convolutions in the VSS blocks, we can more effectively capture and aggregate multi-scale feature representations from the hierarchical features of the VMamba encoder.
arXiv Detail & Related papers (2024-08-25T06:20:28Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - Instruction-Guided Visual Masking [25.26544571379426]
Instruction-guided Visual Masking (IVM) is a versatile visual grounding model that is compatible with diverse multimodal models.
IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions.
arXiv Detail & Related papers (2024-05-30T07:48:32Z) - FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba [19.761723108363796]
FusionMamba aims to overcome the challenges faced by CNNs and Vision Transformers (ViTs) in computer vision tasks.
The framework improves the visual state-space model Mamba by integrating dynamic convolution and channel attention mechanisms.
Experiments show that FusionMamba achieves state-of-the-art performance in a variety of multimodal image fusion tasks as well as downstream experiments.
arXiv Detail & Related papers (2024-04-15T06:37:21Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models [36.41816380074965]
We investigate the effectiveness of different vision encoders within Large Language Models (MLLMs)
Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding.
We propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging.
arXiv Detail & Related papers (2023-10-13T02:41:55Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.