Related papers: StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

URL: http://arxiv.org/abs/2408.01343v1
Date: Fri, 2 Aug 2024 15:41:16 GMT
Title: StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation
Authors: Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li,
Abstract summary: We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers. We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
Score: 63.31007867379312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal semantic segmentation shows significant potential for enhancing segmentation accuracy in complex scenes. However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. This approach facilitates comprehensive multi-modal and multi-scale feature fusion, accommodating any visual modal inputs. Specifically, Our framework achieves modal integration during encoding by sharing multi-modal visual information. To enhance information exchange across modalities, we introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. By leveraging MultiAdapter to propagate multi-scale information across pre-trained encoders during the encoding process, StitchFusion achieves multi-modal visual information integration during encoding. Extensive comparative experiments demonstrate that our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters. Furthermore, the experimental integration of MultiAdapter with existing Feature Fusion Modules (FFMs) highlights their complementary nature. Our code is available at StitchFusion_repo.

Related papers

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z)
M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification [23.322598623627222]
M$3$amba is a novel end-to-end CLIP-driven Mamba model for multi-modal fusion. We introduce CLIP-driven modality-specific adapters to achieve a comprehensive semantic understanding of different modalities. Experiments have shown that M$3$amba has an average performance improvement of at least 5.98% compared with the state-of-the-art methods.
arXiv Detail & Related papers (2025-03-09T05:06:47Z)
Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion. We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
Multimodal Information Interaction for Medical Image Segmentation [24.024848382458767]
We introduce an innovative Multimodal Information Cross Transformer (MicFormer) It queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features. Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively.
arXiv Detail & Related papers (2024-04-25T07:21:14Z)
MMSFormer: Multimodal Transformer for Material and Semantic Segmentation [16.17270247327955]
We propose a novel fusion strategy that can effectively fuse information from different modality combinations. We also propose a new model named Multi-Modal TransFormer (MMSFormer) that incorporates the proposed fusion strategy. MMSFormer outperforms current state-of-the-art models on three different datasets.
arXiv Detail & Related papers (2023-09-07T20:07:57Z)
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge. MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z)
Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.