MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks
- URL: http://arxiv.org/abs/2508.07803v1
- Date: Mon, 11 Aug 2025 09:39:16 GMT
- Title: MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks
- Authors: Yushen Xu, Xiaosong Li, Zhenyu Kuang, Xiaoqi Cheng, Haishu Tan, Huafeng Li,
- Abstract summary: MambaTrans is a novel multimodal fusion image modality translator.<n>It minimizes detection loss during training and captures long-term dependencies among text, masks, and images.<n>Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.
- Score: 6.603164770657262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of multimodal image fusion is to integrate complementary information from infrared and visible images, generating multimodal fused images for downstream tasks. Existing downstream pre-training models are typically trained on visible images. However, the significant pixel distribution differences between visible and multimodal fusion images can degrade downstream task performance, sometimes even below that of using only visible images. This paper explores adapting multimodal fused images with significant modality differences to object detection and semantic segmentation models trained on visible images. To address this, we propose MambaTrans, a novel multimodal fusion image modality translator. MambaTrans uses descriptions from a multimodal large language model and masks from semantic segmentation models as input. Its core component, the Multi-Model State Space Block, combines mask-image-text cross-attention and a 3D-Selective Scan Module, enhancing pure visual capabilities. By leveraging object detection prior knowledge, MambaTrans minimizes detection loss during training and captures long-term dependencies among text, masks, and images. This enables favorable results in pre-trained models without adjusting their parameters. Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.
Related papers
- Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z) - FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z) - OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model [8.619958921346184]
Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis.<n>We propose OSDM-MReg, a novel multimodal image registration framework based image-to-image translation.<n> Experiments demonstrate superior accuracy and efficiency across various multimodal registration tasks.
arXiv Detail & Related papers (2025-04-08T13:32:56Z) - Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation [7.992331117310217]
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation.<n>We design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities.
arXiv Detail & Related papers (2025-03-14T08:31:21Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models [44.299894732492696]
Vision-language models (VLMs) handle image-understanding tasks, focusing on how visual information is processed and transferred to the textual domain.<n>We compare native multimodal VLMs, models trained from scratch on multimodal data to generate both text and images, and non-native multimodal VLMs, models adapted from pre-trained large language models or capable of generating only text, highlighting key differences in information flow.<n>We show that ablating a single token significantly deteriorates image-understanding performance, whereas targeted, token-level interventions reliably steer image semantics and downstream text with fine-grained control.
arXiv Detail & Related papers (2024-12-09T16:39:40Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - ImageBind-LLM: Multi-modality Instruction Tuning [70.05191504511188]
ImageBind-LLM is a multi-modality instruction tuning method of large language models (LLMs) via ImageBind.
It can respond to audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training.
arXiv Detail & Related papers (2023-09-07T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.