Related papers: CMF: Cascaded Multi-model Fusion for Referring Image Segmentation

CMF: Cascaded Multi-model Fusion for Referring Image Segmentation

URL: http://arxiv.org/abs/2106.08617v1
Date: Wed, 16 Jun 2021 08:18:39 GMT
Title: CMF: Cascaded Multi-model Fusion for Referring Image Segmentation
Authors: Jianhua Yang, Yan Huang, Zhanyu Ma, Liang Wang
Abstract summary: We address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. We propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods.
Score: 24.942658173937563
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multi-scale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multi-scale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at https://github.com/jianhua2022/CMF-Refseg.

Related papers

Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z)
TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation [8.48847068018671]
This paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network.<n>It enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS)<n>In the KPS, we design the Multiscale Linear Cross-Attention Module (MLAM), which establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions.<n>The KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies
arXiv Detail & Related papers (2025-09-16T13:26:58Z)
Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation [7.992331117310217]
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation. We design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities.
arXiv Detail & Related papers (2025-03-14T08:31:21Z)
SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding [37.27111432020955]
Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually.
arXiv Detail & Related papers (2025-02-24T02:41:34Z)
Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion. We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z)
FusionSAM: Visual Multi-Modal Learning with Segment Anything [37.61598617788102]
We introduce the Segment Anything Model (SAM) into multimodal image segmentation for the first time.<n>We propose a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules.<n>Our method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios.
arXiv Detail & Related papers (2024-08-26T02:20:55Z)
Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields. Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion. We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z)
Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation [8.383431263616105]
We introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence.
arXiv Detail & Related papers (2024-05-18T07:21:12Z)
RISAM: Referring Image Segmentation via Mutual-Aware Attention Features [13.64992652002458]
Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. We propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism.
arXiv Detail & Related papers (2023-11-27T11:24:25Z)
Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation. We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z)
Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder. Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)
Comprehensive Multi-Modal Interactions for Referring Image Segmentation [7.064383217512461]
We investigate Referring Image (RIS), which outputs a segmentation map corresponding to the given natural language description. To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains. We propose a Joint Reasoning (JRM) module and a novel Cross-Modal Multi-Level Fusion (CMMLF) module for tackling this task.
arXiv Detail & Related papers (2021-04-21T08:45:09Z)
Linguistic Structure Guided Context Modeling for Referring Image Segmentation [61.701577239317785]
We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction. Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
arXiv Detail & Related papers (2020-10-01T16:03:51Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.