Position-Aware Contrastive Alignment for Referring Image Segmentation
- URL: http://arxiv.org/abs/2212.13419v1
- Date: Tue, 27 Dec 2022 09:13:19 GMT
- Title: Position-Aware Contrastive Alignment for Referring Image Segmentation
- Authors: Bo Chen, Zhiwei Hu, Zhilong Ji, Jinfeng Bai, Wangmeng Zuo
- Abstract summary: We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
- Score: 65.16214741785633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring image segmentation aims to segment the target object described by a
given natural language expression. Typically, referring expressions contain
complex relationships between the target and its surrounding objects. The main
challenge of this task is to understand the visual and linguistic content
simultaneously and to find the referred object accurately among all instances
in the image. Currently, the most effective way to solve the above problem is
to obtain aligned multi-modal features by computing the correlation between
visual and linguistic feature modalities under the supervision of the
ground-truth mask. However, existing paradigms have difficulty in thoroughly
understanding visual and linguistic content due to the inability to perceive
information directly about surrounding objects that refer to the target. This
prevents them from learning aligned multi-modal features, which leads to
inaccurate segmentation. To address this issue, we present a position-aware
contrastive alignment network (PCAN) to enhance the alignment of multi-modal
features by guiding the interaction between vision and language through prior
position information. Our PCAN consists of two modules: 1) Position Aware
Module (PAM), which provides position information of all objects related to
natural language descriptions, and 2) Contrastive Language Understanding Module
(CLUM), which enhances multi-modal alignment by comparing the features of the
referred object with those of related objects. Extensive experiments on three
benchmarks demonstrate our PCAN performs favorably against the state-of-the-art
methods. Our code will be made publicly available.
Related papers
- Bi-directional Contextual Attention for 3D Dense Captioning [38.022425401910894]
3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene.
Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object.
We introduce BiCA, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with Bi-directional Contextual Attention.
arXiv Detail & Related papers (2024-08-13T06:25:54Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Phrase-Based Affordance Detection via Cyclic Bilateral Interaction [17.022853987801877]
We explore to perceive affordance from a vision-language perspective and consider the challenging phrase-based affordance detection problem.
We propose a cyclic bilateral consistency enhancement network (CBCE-Net) to align language and vision features progressively.
Specifically, the presented CBCE-Net consists of a mutual guided vision-language module that updates the common features of vision and language in a progressive manner, and a cyclic interaction module (CIM) that facilitates the perception of possible interaction with objects in a cyclic manner.
arXiv Detail & Related papers (2022-02-24T13:02:27Z) - Contrastive Video-Language Segmentation [41.1635597261304]
We focus on the problem of segmenting a certain object referred by a natural language sentence in video content.
We propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective.
arXiv Detail & Related papers (2021-09-29T01:40:58Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.