Collaborative Position Reasoning Network for Referring Image
Segmentation
- URL: http://arxiv.org/abs/2401.11775v1
- Date: Mon, 22 Jan 2024 09:11:12 GMT
- Title: Collaborative Position Reasoning Network for Referring Image
Segmentation
- Authors: Jianjian Cao and Beiya Dai and Yulin Li and Xiameng Qin and Jingdong
Wang
- Abstract summary: We propose a novel method to explicitly model entity localization, especially for non-salient entities.
To our knowledge, this is the first work that explicitly focuses on position reasoning modeling.
- Score: 30.414910144177757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given an image and a natural language expression as input, the goal of
referring image segmentation is to segment the foreground masks of the entities
referred by the expression. Existing methods mainly focus on interactive
learning between vision and language to enhance the multi-modal representations
for global context reasoning. However, predicting directly in pixel-level space
can lead to collapsed positioning and poor segmentation results. Its main
challenge lies in how to explicitly model entity localization, especially for
non-salient entities. In this paper, we tackle this problem by executing a
Collaborative Position Reasoning Network (CPRN) via the proposed novel
Row-and-Column interactive (RoCo) and Guided Holistic interactive (Holi)
modules. Specifically, RoCo aggregates the visual features into the row- and
column-wise features corresponding two directional axes respectively. It offers
a fine-grained matching behavior that perceives the associations between the
linguistic features and two decoupled visual features to perform position
reasoning over a hierarchical space. Holi integrates features of the two
modalities by a cross-modal attention mechanism, which suppresses the
irrelevant redundancy under the guide of positioning information from RoCo.
Thus, with the incorporation of RoCo and Holi modules, CPRN captures the visual
details of position reasoning so that the model can achieve more accurate
segmentation. To our knowledge, this is the first work that explicitly focuses
on position reasoning modeling. We also validate the proposed method on three
evaluation datasets. It consistently outperforms existing state-of-the-art
methods.
Related papers
- Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [9.109484087832058]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.
To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)
To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - RISAM: Referring Image Segmentation via Mutual-Aware Attention Features [13.64992652002458]
Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt.
Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding.
We propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism.
arXiv Detail & Related papers (2023-11-27T11:24:25Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Salient Object Ranking with Position-Preserved Attention [44.94722064885407]
We study the Salient Object Ranking (SOR) task, which manages to assign a ranking order of each detected object according to its visual saliency.
We propose the first end-to-end framework of the SOR task and solve it in a multi-task learning fashion.
We also introduce a Position-Preserved Attention (PPA) module tailored for the SOR branch.
arXiv Detail & Related papers (2021-06-09T13:00:05Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z) - Bidirectional Graph Reasoning Network for Panoptic Segmentation [126.06251745669107]
We introduce a Bidirectional Graph Reasoning Network (BGRNet) to mine the intra-modular and intermodular relations within and between foreground things and background stuff classes.
BGRNet first constructs image-specific graphs in both instance and semantic segmentation branches that enable flexible reasoning at the proposal level and class level.
arXiv Detail & Related papers (2020-04-14T02:32:10Z) - High-Order Information Matters: Learning Relation and Topology for
Occluded Person Re-Identification [84.43394420267794]
We propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment.
Our framework significantly outperforms state-of-the-art by6.5%mAP scores on Occluded-Duke dataset.
arXiv Detail & Related papers (2020-03-18T12:18:35Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.