Contrastive Video-Language Segmentation
- URL: http://arxiv.org/abs/2109.14131v1
- Date: Wed, 29 Sep 2021 01:40:58 GMT
- Title: Contrastive Video-Language Segmentation
- Authors: Chen Liang, Yawei Luo, Yu Wu and Yi Yang
- Abstract summary: We focus on the problem of segmenting a certain object referred by a natural language sentence in video content.
We propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective.
- Score: 41.1635597261304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We focus on the problem of segmenting a certain object referred by a natural
language sentence in video content, at the core of formulating a pinpoint
vision-language relation. While existing attempts mainly construct such
relation in an implicit way, i.e., grid-level multi-modal feature fusion, it
has been proven problematic to distinguish semantically similar objects under
this paradigm. In this work, we propose to interwind the visual and linguistic
modalities in an explicit way via the contrastive learning objective, which
directly aligns the referred object and the language description and separates
the unreferred content apart across frames. Moreover, to remedy for the
degradation problem, we present two complementary hard instance mining
strategies, i.e., Language-relevant Channel Filter and Relative Hard Instance
Construction. They encourage the network to exclude visual-distinguishable
feature and to focus on easy-confused objects during the contrastive training.
Extensive experiments on two benchmarks, i.e., A2D Sentences and J-HMDB
Sentences, quantitatively demonstrate the state-of-the-arts performance of our
method and qualitatively show the more accurate distinguishment between
semantically similar objects over baselines.
Related papers
- Context Propagation from Proposals for Semantic Video Object Segmentation [1.223779595809275]
We propose a novel approach to learning semantic contextual relationships in videos for semantic object segmentation.
Our proposals derives the semantic contexts from video object which encode the key evolution of objects and the relationship among objects over semantic-temporal domain.
arXiv Detail & Related papers (2024-07-08T14:44:18Z) - Verbs in Action: Improving verb understanding in video-language models [128.87443209118726]
State-of-the-art video-language models based on CLIP have been shown to have limited verb understanding.
We improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive framework.
arXiv Detail & Related papers (2023-04-13T17:57:01Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Phrase-Based Affordance Detection via Cyclic Bilateral Interaction [17.022853987801877]
We explore to perceive affordance from a vision-language perspective and consider the challenging phrase-based affordance detection problem.
We propose a cyclic bilateral consistency enhancement network (CBCE-Net) to align language and vision features progressively.
Specifically, the presented CBCE-Net consists of a mutual guided vision-language module that updates the common features of vision and language in a progressive manner, and a cyclic interaction module (CIM) that facilitates the perception of possible interaction with objects in a cyclic manner.
arXiv Detail & Related papers (2022-02-24T13:02:27Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - ClawCraneNet: Leveraging Object-level Relation for Text-based Video
Segmentation [47.7867284770227]
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.
We introduce a novel top-down approach by imitating how we human segment an object with the language guidance.
Our method outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-03-19T09:31:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.