DFR: A Decompose-Fuse-Reconstruct Framework for Multi-Modal Few-Shot Segmentation
- URL: http://arxiv.org/abs/2507.16736v1
- Date: Tue, 22 Jul 2025 16:21:32 GMT
- Title: DFR: A Decompose-Fuse-Reconstruct Framework for Multi-Modal Few-Shot Segmentation
- Authors: Shuai Chen, Fanman Meng, Xiwei Zhang, Haoran Wei, Chenhao Wu, Qingbo Wu, Hongliang Li,
- Abstract summary: DFR (Decompose, Fuse and Reconstruct) is a novel framework that addresses the challenge of effectively utilizing multi-modal guidance in few-shot segmentation.
- Score: 21.536784761515968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents DFR (Decompose, Fuse and Reconstruct), a novel framework that addresses the fundamental challenge of effectively utilizing multi-modal guidance in few-shot segmentation (FSS). While existing approaches primarily rely on visual support samples or textual descriptions, their single or dual-modal paradigms limit exploitation of rich perceptual information available in real-world scenarios. To overcome this limitation, the proposed approach leverages the Segment Anything Model (SAM) to systematically integrate visual, textual, and audio modalities for enhanced semantic understanding. The DFR framework introduces three key innovations: 1) Multi-modal Decompose: a hierarchical decomposition scheme that extracts visual region proposals via SAM, expands textual semantics into fine-grained descriptors, and processes audio features for contextual enrichment; 2) Multi-modal Contrastive Fuse: a fusion strategy employing contrastive learning to maintain consistency across visual, textual, and audio modalities while enabling dynamic semantic interactions between foreground and background features; 3) Dual-path Reconstruct: an adaptive integration mechanism combining semantic guidance from tri-modal fused tokens with geometric cues from multi-modal location priors. Extensive experiments across visual, textual, and audio modalities under both synthetic and real settings demonstrate DFR's substantial performance improvements over state-of-the-art methods.
Related papers
- TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models [123.17643568298116]
We present TAViS, a novel framework that textbfcouples the knowledge of multimodal foundation models for cross-modal alignment.<n> effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision.<n>Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
arXiv Detail & Related papers (2025-06-13T03:19:47Z) - IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs.<n>Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text.<n>Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z) - Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.<n>We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.<n>We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding [39.55810156545949]
We propose a Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space.
Our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.
arXiv Detail & Related papers (2024-07-13T05:39:17Z) - TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis [26.867610944625337]
Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities.<n>Past research predominantly focused on improving representation learning techniques and feature fusion strategies.<n>We introduce a Text-oriented Cross-Attention Network (TCAN) emphasizing the predominant role of the text modality in MSA.
arXiv Detail & Related papers (2024-04-06T07:56:09Z) - FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues [20.587249765287183]
Feature Swapping Multi-modal Reasoning (FSMR) model is designed to enhance multi-modal reasoning through feature swapping.
FSMR incorporates a multi-modal cross-attention mechanism, facilitating the joint modeling of textual and visual information.
Experiments on the PMR dataset demonstrate FSMR's superiority over state-of-the-art baseline models.
arXiv Detail & Related papers (2024-03-29T07:28:50Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - MIR-GAN: Refining Frame-Level Modality-Invariant Representations with
Adversarial Network for Audio-Visual Speech Recognition [23.042478625584653]
We propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN)
In particular, we propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN)
arXiv Detail & Related papers (2023-06-18T14:02:20Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Semantics-Consistent Cross-domain Summarization via Optimal Transport
Alignment [80.18786847090522]
We propose a Semantics-Consistent Cross-domain Summarization model based on optimal transport alignment with visual and textual segmentation.
We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
arXiv Detail & Related papers (2022-10-10T14:27:10Z) - Efficient Multimodal Transformer with Dual-Level Feature Restoration for
Robust Multimodal Sentiment Analysis [47.29528724322795]
Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently.
Despite significant progress, there are still two major challenges on the way towards robust MSA.
We propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR)
arXiv Detail & Related papers (2022-08-16T08:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.