Rethinking Early-Fusion Strategies for Improved Multimodal Image Segmentation
- URL: http://arxiv.org/abs/2501.10958v1
- Date: Sun, 19 Jan 2025 06:16:45 GMT
- Title: Rethinking Early-Fusion Strategies for Improved Multimodal Image Segmentation
- Authors: Zhengwen Shen, Yulian Li, Han Zhang, Yuchen Weng, Jun Wang,
- Abstract summary: We propose a novel multimodal fusion network (EFNet) based on an early fusion strategy and a simple but effective feature clustering for training efficient RGB-T semantic segmentation.
We validate the effectiveness of our method on different datasets and outperform previous state-of-the-art methods with lower parameters and computation.
- Score: 7.757018983487103
- License:
- Abstract: RGB and thermal image fusion have great potential to exhibit improved semantic segmentation in low-illumination conditions. Existing methods typically employ a two-branch encoder framework for multimodal feature extraction and design complicated feature fusion strategies to achieve feature extraction and fusion for multimodal semantic segmentation. However, these methods require massive parameter updates and computational effort during the feature extraction and fusion. To address this issue, we propose a novel multimodal fusion network (EFNet) based on an early fusion strategy and a simple but effective feature clustering for training efficient RGB-T semantic segmentation. In addition, we also propose a lightweight and efficient multi-scale feature aggregation decoder based on Euclidean distance. We validate the effectiveness of our method on different datasets and outperform previous state-of-the-art methods with lower parameters and computation.
Related papers
- MAGIC++: Efficient and Resilient Modality-Agnostic Semantic Segmentation via Hierarchical Modality Selection [20.584588303521496]
We introduce the MAGIC++ framework, which comprises two key plug-and-play modules for effective multi-modal fusion and hierarchical modality selection.
Our method achieves state-of-the-art performance on both real-world and synthetic benchmarks.
Our method is superior in the novel modality-agnostic setting, where it outperforms prior arts by a large margin.
arXiv Detail & Related papers (2024-12-22T06:12:03Z) - Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation [7.797154022794006]
Recent endeavors regard RGB modality as the center and the others as the auxiliary, yielding an asymmetric architecture with two branches.
We propose a novel method, named MAGIC, that can be flexibly paired with various backbones, ranging from compact to high-performance models.
Our method achieves state-of-the-art performance while reducing the model parameters by 60%.
arXiv Detail & Related papers (2024-07-16T03:19:59Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - ICAFusion: Iterative Cross-Attention Guided Feature Fusion for
Multispectral Object Detection [25.66305300362193]
A novel feature fusion framework of dual cross-attention transformers is proposed to model global feature interaction.
This framework enhances the discriminability of object features through the query-guided cross-attention mechanism.
The proposed method achieves superior performance and faster inference, making it suitable for various practical scenarios.
arXiv Detail & Related papers (2023-08-15T00:02:10Z) - A Comparative Assessment of Multi-view fusion learning for Crop
Classification [3.883984493622102]
This work assesses different fusion strategies for crop classification in the CropHarvest dataset.
We present a comparison of multi-view fusion methods for three different datasets and show that, depending on the test region, different methods obtain the best performance.
arXiv Detail & Related papers (2023-08-10T08:03:58Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - Searching a Compact Architecture for Robust Multi-Exposure Image Fusion [55.37210629454589]
Two major stumbling blocks hinder the development, including pixel misalignment and inefficient inference.
This study introduces an architecture search-based paradigm incorporating self-alignment and detail repletion modules for robust multi-exposure image fusion.
The proposed method outperforms various competitive schemes, achieving a noteworthy 3.19% improvement in PSNR for general scenarios and an impressive 23.5% enhancement in misaligned scenarios.
arXiv Detail & Related papers (2023-05-20T17:01:52Z) - Complementary Random Masking for RGB-Thermal Semantic Segmentation [63.93784265195356]
RGB-thermal semantic segmentation is a potential solution to achieve reliable semantic scene understanding in adverse weather and lighting conditions.
This paper proposes 1) a complementary random masking strategy of RGB-T images and 2) self-distillation loss between clean and masked input modalities.
We achieve state-of-the-art performance over three RGB-T semantic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-30T13:57:21Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z) - MPI: Multi-receptive and Parallel Integration for Salient Object
Detection [17.32228882721628]
The semantic representation of deep features is essential for image context understanding.
In this paper, a novel method called MPI is proposed for salient object detection.
The proposed method outperforms state-of-the-art methods under different evaluation metrics.
arXiv Detail & Related papers (2021-08-08T12:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.