Position-Aware Contrastive Alignment for Referring Image Segmentation
- URL: http://arxiv.org/abs/2212.13419v1
- Date: Tue, 27 Dec 2022 09:13:19 GMT
- Title: Position-Aware Contrastive Alignment for Referring Image Segmentation
- Authors: Bo Chen, Zhiwei Hu, Zhilong Ji, Jinfeng Bai, Wangmeng Zuo
- Abstract summary: We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
- Score: 65.16214741785633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring image segmentation aims to segment the target object described by a
given natural language expression. Typically, referring expressions contain
complex relationships between the target and its surrounding objects. The main
challenge of this task is to understand the visual and linguistic content
simultaneously and to find the referred object accurately among all instances
in the image. Currently, the most effective way to solve the above problem is
to obtain aligned multi-modal features by computing the correlation between
visual and linguistic feature modalities under the supervision of the
ground-truth mask. However, existing paradigms have difficulty in thoroughly
understanding visual and linguistic content due to the inability to perceive
information directly about surrounding objects that refer to the target. This
prevents them from learning aligned multi-modal features, which leads to
inaccurate segmentation. To address this issue, we present a position-aware
contrastive alignment network (PCAN) to enhance the alignment of multi-modal
features by guiding the interaction between vision and language through prior
position information. Our PCAN consists of two modules: 1) Position Aware
Module (PAM), which provides position information of all objects related to
natural language descriptions, and 2) Contrastive Language Understanding Module
(CLUM), which enhances multi-modal alignment by comparing the features of the
referred object with those of related objects. Extensive experiments on three
benchmarks demonstrate our PCAN performs favorably against the state-of-the-art
methods. Our code will be made publicly available.
Related papers
- LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [56.474856189865946]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z) - DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation [1.4154022683679812]
Vision-and-Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues.<n>We propose a Dual Object Perception-Enhancement Network (DOPE) to address these issues to improve navigation performance.
arXiv Detail & Related papers (2025-04-30T06:47:13Z) - Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework.
CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks.
Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z) - ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives [109.11714588441511]
The Ego-Exo object correspondence task aims to understand object relations across ego-exo perspectives through segmentation.<n> PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task.<n>We propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion and SSL-based Cross-View Object Alignment.
arXiv Detail & Related papers (2024-11-28T12:01:03Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Bi-directional Contextual Attention for 3D Dense Captioning [38.022425401910894]
3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene.
Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object.
We introduce BiCA, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with Bi-directional Contextual Attention.
arXiv Detail & Related papers (2024-08-13T06:25:54Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Phrase-Based Affordance Detection via Cyclic Bilateral Interaction [17.022853987801877]
We explore to perceive affordance from a vision-language perspective and consider the challenging phrase-based affordance detection problem.
We propose a cyclic bilateral consistency enhancement network (CBCE-Net) to align language and vision features progressively.
Specifically, the presented CBCE-Net consists of a mutual guided vision-language module that updates the common features of vision and language in a progressive manner, and a cyclic interaction module (CIM) that facilitates the perception of possible interaction with objects in a cyclic manner.
arXiv Detail & Related papers (2022-02-24T13:02:27Z) - Contrastive Video-Language Segmentation [41.1635597261304]
We focus on the problem of segmenting a certain object referred by a natural language sentence in video content.
We propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective.
arXiv Detail & Related papers (2021-09-29T01:40:58Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.