Related papers: Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well

Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well

URL: http://arxiv.org/abs/2502.14471v2
Date: Mon, 19 May 2025 04:42:52 GMT
Title: Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well
Authors: Chengyu Fang, Chunming He, Longxiang Tang, Yuelin Zhang, Chenyang Zhu, Yuqi Shen, Chubin Chen, Guoxia Xu, Xiu Li,
Abstract summary: MultiCOS is a novel framework that effectively leverages diverse data modalities to improve segmentation performance.<n>BFSer outperforms existing multimodal baselines with both real and pseudo-modal data.
Score: 23.460400679372714
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Camouflaged Object Segmentation (COS) remains challenging because camouflaged objects exhibit only subtle visual differences from their backgrounds and single-modality RGB methods provide limited cues, leading researchers to explore multimodal data to improve segmentation accuracy. In this work, we presenet MultiCOS, a novel framework that effectively leverages diverse data modalities to improve segmentation performance. MultiCOS comprises two modules: Bi-space Fusion Segmentor (BFSer), which employs a state space and a latent space fusion mechanism to integrate cross-modal features within a shared representation and employs a fusion-feedback mechanism to refine context-specific features, and Cross-modal Knowledge Learner (CKLer), which leverages external multimodal datasets to generate pseudo-modal inputs and establish cross-modal semantic associations, transferring knowledge to COS models when real multimodal pairs are missing. When real multimodal COS data are unavailable, CKLer yields additional segmentation gains using only non-COS multimodal sources. Experiments on standard COS benchmarks show that BFSer outperforms existing multimodal baselines with both real and pseudo-modal data. Code will be released at \href{https://github.com/cnyvfang/MultiCOS}{GitHub}.

Related papers

Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation [7.992331117310217]
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation. We design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities.
arXiv Detail & Related papers (2025-03-14T08:31:21Z)
Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond [52.486290612938895]
We propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Enable downstream task adaptability. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. Our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency.
arXiv Detail & Related papers (2025-03-03T06:16:31Z)
MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases.<n>We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models.<n>With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z)
COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection [9.913133285133998]
Single-modal object detection tasks often experience performance degradation when encountering diverse scenarios. multimodal object detection tasks can offer more comprehensive information about object features by integrating data from various modalities. In this paper, we propose a novel approach called the CrOss-Mamba interaction and Offset-guided fusion framework.
arXiv Detail & Related papers (2024-12-24T01:14:48Z)
MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt [60.10555128510744]
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities.<n>Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks.<n>We introduce a novel framework called MambaPro for multi-modal object ReID.
arXiv Detail & Related papers (2024-12-14T06:33:53Z)
Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.<n>We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.<n>We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z)
Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD) We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese) In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z)
FusionSAM: Visual Multi-Modal Learning with Segment Anything [37.61598617788102]
We introduce the Segment Anything Model (SAM) into multimodal image segmentation for the first time.<n>We propose a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules.<n>Our method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios.
arXiv Detail & Related papers (2024-08-26T02:20:55Z)
StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.<n>We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.<n>Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
Multimodal Collaboration Networks for Geospatial Vehicle Detection in Dense, Occluded, and Large-Scale Events [29.86323896541765]
In large-scale disaster events, the planning of optimal rescue routes depends on the object detection ability at the disaster scene. Existing methods, which are typically based on the RGB modality, struggle to distinguish targets with similar colors and textures in crowded environments. We propose a multimodal collaboration network for dense and occluded vehicle detection, MuDet.
arXiv Detail & Related papers (2024-05-14T00:51:15Z)
Multimodal Information Interaction for Medical Image Segmentation [24.024848382458767]
We introduce an innovative Multimodal Information Cross Transformer (MicFormer) It queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features. Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively.
arXiv Detail & Related papers (2024-04-25T07:21:14Z)
Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment. Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space. We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z)
GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition [37.12407597998884]
Emotion Recognition in Conversation (ERC) plays a significant part in Human-Computer Interaction (HCI) systems since it can provide empathetic services. In multimodal ERC, Graph Neural Networks (GNNs) are capable of extracting both long-distance contextual information and inter-modal interactive information. We present a directed Graph based Cross-modal Feature Complementation (GraphCFC) module that can efficiently model contextual and interactive information.
arXiv Detail & Related papers (2022-07-06T13:56:48Z)
Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks. We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z)
Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection. Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps. Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.